Batch Data Processing in Python: Combining and Normalizing Datasets


Batch Data Processing in Python: Combining and Normalizing Datasets

Batch Data Processing in Python: Combining and Normalizing Datasets

Table of Contents

Why Batch Data Processing Matters

In today's data-driven world, organizations generate and store enormous volumes of data from various sources — sales reports, marketing insights, sensor logs, and user behavior metrics, to name a few. However, these datasets are often distributed across multiple files, databases, or systems, making it challenging to analyze them directly.

Batch data processing allows us to handle these datasets efficiently by reading, merging, transforming, and standardizing them into a unified format. This step is critical for business intelligence, predictive modeling, and machine learning pipelines.

Setting Up Your Python Environment

Before diving into batch processing, let's ensure we have the right tools installed. Python's Pandas library is the go-to choice for data manipulation, while NumPy and Scikit-learn help with normalization and transformations.


pip install pandas numpy scikit-learn pyarrow

Why these libraries?

  • Pandas: For reading, merging, and transforming datasets.
  • NumPy: For handling large numerical arrays efficiently.
  • Scikit-learn: For data normalization and preprocessing.
  • PyArrow: For working with fast columnar formats like Parquet.

Reading Multiple Datasets Efficiently

Let's start by reading multiple datasets stored in CSV format. Assume we have monthly sales reports named sales_2023_01.csv, sales_2023_02.csv, and so on.


import pandas as pd
import glob

# Fetch all CSV files matching the pattern
csv_files = glob.glob("data/sales_*.csv")

# Read each CSV into a list of DataFrames
dfs = [pd.read_csv(file) for file in csv_files]

print(f"Successfully loaded {len(dfs)} datasets.")

If your datasets come from different formats, Pandas also supports Excel, JSON, SQL, and Parquet seamlessly.

Combining Datasets

After reading multiple datasets, the next step is combining them into a single DataFrame. The approach depends on whether your datasets share the same schema or not.

Vertical Concatenation


df = pd.concat(dfs, ignore_index=True)

Handling Inconsistent Column Names


rename_map = {
    "Date": "date",
    "Sales": "sales",
    "Product": "product"
}
dfs = [df.rename(columns=rename_map) for df in dfs]
df = pd.concat(dfs, ignore_index=True)

Horizontal Merging

Sometimes, your datasets represent different aspects of the same entity. For example, combining sales and customer information:


df = pd.merge(sales_df, customer_df, on="customer_id", how="left")

Normalizing and Cleaning Data

Data normalization ensures consistency and prepares the dataset for downstream analytics and machine learning workflows.

Handling Missing Values


df = df.fillna({"sales": 0})
df = df.dropna(subset=["product"])

Standardizing Date Formats


df["date"] = pd.to_datetime(df["date"], errors="coerce")

Cleaning Categorical Variables


df["product"] = df["product"].str.lower().str.strip()

Min-Max Normalization


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df["sales_normalized"] = scaler.fit_transform(df[["sales"]])

Performance Optimization Strategies

When processing large-scale datasets, performance becomes crucial. Here are some optimization tips:

  • Use Parquet Instead of CSV — Faster reads/writes.

df.to_parquet("combined.parquet", engine="pyarrow")
  • Chunked Reading for Large CSVs

for chunk in pd.read_csv("bigdata.csv", chunksize=500000):
    process(chunk)
  • Leverage Alternative Libraries: Use Dask or Polars for TB-scale datasets.

End-to-End Pipeline Example

Let's bring everything together with a complete pipeline:


import pandas as pd
import glob
from sklearn.preprocessing import MinMaxScaler

# 1. Read multiple CSVs
csv_files = glob.glob("data/sales_*.csv")
dfs = [pd.read_csv(file) for file in csv_files]

# 2. Combine datasets
df = pd.concat(dfs, ignore_index=True)

# 3. Clean and normalize data
df = df.fillna({"sales": 0})
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["product"] = df["product"].str.lower().str.strip()

scaler = MinMaxScaler()
df["sales_normalized"] = scaler.fit_transform(df[["sales"]])

# 4. Save processed data
df.to_parquet("processed_sales.parquet", engine="pyarrow")

Advanced Tips and Best Practices

  • Use schema validation libraries like pandera to enforce consistent data structures.
  • Implement fuzzy matching to handle inconsistent column names automatically.
  • Leverage parallel processing with multiprocessing for faster ETL pipelines.
  • Integrate with workflow orchestrators like Apache Airflow or Prefect for production-grade batch pipelines.

By following these practices, you'll create a robust, efficient, and scalable data processing pipeline ready for analytics and machine learning.

Comments