Extracting JSON and XML Data for ETL with Python: A Practical Guide
In modern data workflows, JSON and XML have become the backbone formats for structured and semi-structured data. Whether you are integrating APIs, migrating data, or building ETL pipelines, extracting, cleaning, and transforming these data formats is an essential skill for any data engineer or analyst. In this article, we will dive deep into how to handle JSON and XML data in Python and integrate it efficiently into ETL processes.
Understanding JSON and XML Formats
Before we dive into coding, it is essential to understand the fundamental differences between JSON and XML.
- JSON (JavaScript Object Notation): Lightweight, easy to parse in Python with
jsonlibrary, commonly used in web APIs. - XML (eXtensible Markup Language): Verbose but highly structured, widely used in enterprise systems and legacy APIs.
Example of JSON:
{
"employees": [
{"name": "Alice", "age": 30, "department": "HR"},
{"name": "Bob", "age": 25, "department": "IT"}
]
}
Example of XML:
<employees>
<employee>
<name>Alice</name>
<age>30</age>
<department>HR</department>
</employee>
<employee>
<name>Bob</name>
<age>25</age>
<department>IT</department>
</employee>
</employees>
Setting Up Python Environment
To extract JSON and XML data for ETL pipelines, you need a Python environment with essential libraries:
pip install pandas requests lxml
We will use:
json– native Python module for JSONxml.etree.ElementTree– built-in XML parserpandas– for data manipulationrequests– fetching data from APIs
Extracting JSON Data
JSON data can come from local files or web APIs.
From Local Files
import json
with open('data.json') as f:
data = json.load(f)
print(data)
From Web APIs
import requests
url = 'https://api.example.com/employees'
response = requests.get(url)
data = response.json()
print(data)
Handling Nested JSON
Nested JSON is common in APIs. Use pandas.json_normalize to flatten:
import pandas as pd
df = pd.json_normalize(data, 'employees')
print(df)
Extracting XML Data
XML data extraction is similar but requires handling tree structures.
From Local Files
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for emp in root.findall('employee'):
name = emp.find('name').text
age = emp.find('age').text
dept = emp.find('department').text
print(name, age, dept)
From APIs
import requests
import xml.etree.ElementTree as ET
url = 'https://api.example.com/employees.xml'
response = requests.get(url)
root = ET.fromstring(response.content)
for emp in root.findall('employee'):
print(emp.find('name').text, emp.find('age').text, emp.find('department').text)
Handling XML Namespaces
Many XML files use namespaces. Use a namespace dictionary:
ns = {'ns': 'http://example.com/schema'}
for emp in root.findall('ns:employee', ns):
print(emp.find('ns:name', ns).text)
Cleaning and Transforming Data
After extracting data, the next step is cleaning and transforming for ETL:
import pandas as pd
# For JSON
df_json = pd.json_normalize(data, 'employees')
# Drop duplicates
df_json.drop_duplicates(inplace=True)
# Fill missing values
df_json.fillna({'department': 'Unknown'}, inplace=True)
print(df_json)
# For XML
rows = []
for emp in root.findall('employee'):
rows.append({
'name': emp.find('name').text,
'age': emp.find('age').text,
'department': emp.find('department').text or 'Unknown'
})
df_xml = pd.DataFrame(rows)
print(df_xml)
Integrating Extracted Data into ETL Pipelines
Finally, we integrate cleaned data into ETL pipelines:
# Export to CSV
df_json.to_csv('employees_json.csv', index=False)
df_xml.to_csv('employees_xml.csv', index=False)
# Or load into a database (example with SQLAlchemy)
from sqlalchemy import create_engine
engine = create_engine('sqlite:///employees.db')
df_json.to_sql('employees_json', con=engine, if_exists='replace', index=False)
df_xml.to_sql('employees_xml', con=engine, if_exists='replace', index=False)
With these steps, you can efficiently extract, clean, transform, and load JSON and XML data into your ETL pipelines using Python.
Extracting JSON and XML data is a core task in ETL workflows. By leveraging Python’s built-in libraries, along with pandas and requests, you can handle a wide range of data sources effectively. Remember to:
- Understand the data format (JSON vs XML)
- Use robust methods for parsing and flattening nested structures
- Clean and transform data before loading
- Integrate data smoothly into ETL pipelines for analysis or storage
With these skills, you are ready to build scalable ETL pipelines and handle real-world JSON and XML datasets efficiently.



Comments
Post a Comment