Extracting JSON and XML Data for ETL with Python

Extracting JSON and XML Data for ETL with Python

Extracting JSON and XML Data for ETL with Python: A Practical Guide

In modern data workflows, JSON and XML have become the backbone formats for structured and semi-structured data. Whether you are integrating APIs, migrating data, or building ETL pipelines, extracting, cleaning, and transforming these data formats is an essential skill for any data engineer or analyst. In this article, we will dive deep into how to handle JSON and XML data in Python and integrate it efficiently into ETL processes.

Understanding JSON and XML Formats

Before we dive into coding, it is essential to understand the fundamental differences between JSON and XML.

  • JSON (JavaScript Object Notation): Lightweight, easy to parse in Python with json library, commonly used in web APIs.
  • XML (eXtensible Markup Language): Verbose but highly structured, widely used in enterprise systems and legacy APIs.

Example of JSON:

{
  "employees": [
    {"name": "Alice", "age": 30, "department": "HR"},
    {"name": "Bob", "age": 25, "department": "IT"}
  ]
}

Example of XML:

<employees>
    <employee>
        <name>Alice</name>
        <age>30</age>
        <department>HR</department>
    </employee>
    <employee>
        <name>Bob</name>
        <age>25</age>
        <department>IT</department>
    </employee>
</employees>

Setting Up Python Environment

To extract JSON and XML data for ETL pipelines, you need a Python environment with essential libraries:

pip install pandas requests lxml

We will use:

  • json – native Python module for JSON
  • xml.etree.ElementTree – built-in XML parser
  • pandas – for data manipulation
  • requests – fetching data from APIs

Extracting JSON Data

JSON data can come from local files or web APIs.

From Local Files

import json

with open('data.json') as f:
    data = json.load(f)

print(data)

From Web APIs

import requests

url = 'https://api.example.com/employees'
response = requests.get(url)
data = response.json()

print(data)

Handling Nested JSON

Nested JSON is common in APIs. Use pandas.json_normalize to flatten:

import pandas as pd

df = pd.json_normalize(data, 'employees')
print(df)

Extracting XML Data

XML data extraction is similar but requires handling tree structures.

From Local Files

import xml.etree.ElementTree as ET

tree = ET.parse('data.xml')
root = tree.getroot()

for emp in root.findall('employee'):
    name = emp.find('name').text
    age = emp.find('age').text
    dept = emp.find('department').text
    print(name, age, dept)

From APIs

import requests
import xml.etree.ElementTree as ET

url = 'https://api.example.com/employees.xml'
response = requests.get(url)
root = ET.fromstring(response.content)

for emp in root.findall('employee'):
    print(emp.find('name').text, emp.find('age').text, emp.find('department').text)

Handling XML Namespaces

Many XML files use namespaces. Use a namespace dictionary:

ns = {'ns': 'http://example.com/schema'}
for emp in root.findall('ns:employee', ns):
    print(emp.find('ns:name', ns).text)

Cleaning and Transforming Data

After extracting data, the next step is cleaning and transforming for ETL:

import pandas as pd

# For JSON
df_json = pd.json_normalize(data, 'employees')

# Drop duplicates
df_json.drop_duplicates(inplace=True)

# Fill missing values
df_json.fillna({'department': 'Unknown'}, inplace=True)

print(df_json)
# For XML
rows = []
for emp in root.findall('employee'):
    rows.append({
        'name': emp.find('name').text,
        'age': emp.find('age').text,
        'department': emp.find('department').text or 'Unknown'
    })

df_xml = pd.DataFrame(rows)
print(df_xml)

Integrating Extracted Data into ETL Pipelines

Finally, we integrate cleaned data into ETL pipelines:

# Export to CSV
df_json.to_csv('employees_json.csv', index=False)
df_xml.to_csv('employees_xml.csv', index=False)

# Or load into a database (example with SQLAlchemy)
from sqlalchemy import create_engine

engine = create_engine('sqlite:///employees.db')
df_json.to_sql('employees_json', con=engine, if_exists='replace', index=False)
df_xml.to_sql('employees_xml', con=engine, if_exists='replace', index=False)

With these steps, you can efficiently extract, clean, transform, and load JSON and XML data into your ETL pipelines using Python.

Extracting JSON and XML data is a core task in ETL workflows. By leveraging Python’s built-in libraries, along with pandas and requests, you can handle a wide range of data sources effectively. Remember to:

  • Understand the data format (JSON vs XML)
  • Use robust methods for parsing and flattening nested structures
  • Clean and transform data before loading
  • Integrate data smoothly into ETL pipelines for analysis or storage

With these skills, you are ready to build scalable ETL pipelines and handle real-world JSON and XML datasets efficiently.

Comments