Data Cleaning is Pipeline Architecture

Before any tool or tutorial: every transformation you apply is a contract β€” it must be traceable, repeatable, and scoped. One undocumented transformation buried in a loop can silently corrupt downstream analysis.

πŸ“

Scale Awareness

Working with 1.8M+ row raw .txt files requires different tools than a 10K CSV. Know when to swap pandas for polars or PySpark.

πŸ”

Reproducibility

Every cleaning function should be idempotent β€” run it 10 times, get the same output. No side effects, no silent mutations.

πŸ“‹

Audit Trails

Document every decision: why a null was dropped, why a column was renamed, why a row was excluded. Future you will thank you.

πŸ”’

Raw Data is Sacred

Never overwrite your source files. The raw zone is read-only. Always. Even if your professor gives you a single messy .txt file.

πŸ’‘ The One Habit That Would Save Most Projects: Write a data quality report before analysis begins β€” not after. Profile every column: null %, unique value counts, type distribution, min/max. Tools like ydata-profiling or sweetviz generate this automatically.

Pipeline Zone Architecture

No matter how long your pipeline is, you must define zones of trust. These zones apply whether you're using Python, SQL, dbt, or a no-code tool.

RAW ZONE

Untouched source data. Never modified. Store read-only. Hash-checksum it.

data/raw/source_2024.txt
β–Ά
STAGING ZONE

Transformations happen here. Every step logged. Types inferred. Nulls handled. Duplicates removed.

data/staging/cleaned_v1.csv
β–Ά
CLEAN ZONE

Validated, typed, deduplicated. Schema assertions pass. Row count reconciled.

data/clean/final.csv
β–Ά
ANALYSIS ZONE

Derived only from the clean zone. Never sourced from staging or raw.

notebooks/analysis.ipynb

Enforcing Zones Without a Database

project/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/          ← READ ONLY. Never touch after initial save.
β”‚   β”œβ”€β”€ staging/      ← Working copies, intermediate transforms
β”‚   └── clean/        ← Final validated output
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_profile.ipynb
β”‚   β”œβ”€β”€ 02_clean.ipynb
β”‚   └── 03_analysis.ipynb
β”œβ”€β”€ scripts/
β”‚   └── clean_pipeline.py
└── logs/
    └── transform_audit.log
⚠️ Rule: If a script writes to data/raw/, it is wrong. Make raw read-only at the OS level: chmod 444 data/raw/* on Mac/Linux.

Data Integrity Practices

These practices apply even when you have no database, no schema enforcement, nothing but Python and a text file.

Hash the raw file before and after loading to confirm you haven't mutated it.

import hashlib

def file_hash(path):
    h = hashlib.md5()
    with open(path, 'rb') as f:
        h.update(f.read())
    return h.hexdigest()

raw_hash = file_hash('data/raw/source.txt')
print(f"Raw hash: {raw_hash}")
# Save this hash. If it ever changes, someone touched raw data.

After every transformation step, assert expected column types and value ranges.

import pandera as pa

schema = pa.DataFrameSchema({
    "age": pa.Column(int, pa.Check.between(0, 120)),
    "email": pa.Column(str, pa.Check(lambda s: s.str.contains('@'))),
    "revenue": pa.Column(float, pa.Check.ge(0), nullable=True),
})

validated_df = schema.validate(df)

Log input row count vs. output row count at every stage. Any unexplained drop is a bug.

def log_transform(df_in, df_out, step_name):
    dropped = len(df_in) - len(df_out)
    print(f"[{step_name}] IN: {len(df_in):,} | OUT: {len(df_out):,} | DROPPED: {dropped:,}")
    if dropped < 0:
        print(f"  ⚠️ Row count INCREASED β€” check for unintended duplications")
    return df_out

df_clean = log_transform(df_raw, df_deduped, "deduplication")

Document why each null exists. There are different types of missing data β€” confusing them leads to wrong imputation.

TypeMeaningCorrect Action
MCARMissing Completely At Random β€” no patternSafe to drop or simple impute
MARMissing At Random β€” pattern in other columnsModel-based imputation (KNN, regression)
MNARMissing Not At Random β€” absence IS the dataFlag it; don't impute blindly
IntroducedCreated by a bad join or parsing errorFix the upstream transform

Every cleaning function must produce the same output no matter how many times it runs on the same input.

# βœ… Idempotent β€” safe to re-run
def normalize_emails(df):
    df = df.copy()  # never mutate in place
    df['email'] = df['email'].str.lower().str.strip()
    return df

# ❌ NOT idempotent β€” appends a suffix every run
def tag_rows(df):
    df['label'] = df['label'] + '_processed'  # stacks on reruns
    return df

Python Deep Dive

You know pandas β€” now level up to the tools that handle 1.8M rows without choking.

Polars β€” 10–50x Faster Than Pandas

Rust-backed DataFrame library with lazy evaluation. For 1M+ row files, this is no longer optional.

import polars as pl

df = (
    pl.scan_csv("data/raw/large_dataset.csv")  # doesn't load into memory yet
    .filter(pl.col("revenue") > 0)
    .with_columns([
        pl.col("email").str.to_lowercase().alias("email_clean"),
        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date_parsed"),
    ])
    .drop_nulls(subset=["customer_id"])
    .collect()  # NOW it executes
)

print(df.shape)
Use scan_csv() instead of read_csv() for large files β€” it streams and only loads what the query needs.

pyjanitor β€” Readable Pipeline Chains

Method-chaining API built on pandas. Makes cleaning pipelines read like prose.

import janitor
import pandas as pd

df = (
    pd.read_csv("data/raw/messy.csv")
    .clean_names()
    .remove_empty()
    .rename_column("cust_nm", "customer_name")
    .drop_duplicates(subset="customer_id")
    .filter_on("revenue > 0")
    .fill_empty(column_names=["region"], value="UNKNOWN")
)

ftfy β€” Fix Raw Text Encoding Problems

When your professor gives you a raw .txt file and strings look like Ò€ℒ β€” that's broken Unicode. ftfy fixes it automatically.

import ftfy, chardet

with open("data/raw/source.txt", "rb") as f:
    detected = chardet.detect(f.read(100000))
    print(detected)  # {'encoding': 'Windows-1252', 'confidence': 0.73}

with open("data/raw/source.txt", encoding=detected['encoding'], errors='replace') as f:
    raw_text = f.read()

fixed_text = ftfy.fix_text(raw_text)
df['description'] = df['description'].apply(ftfy.fix_text)
Always detect encoding before loading. UTF-8 is not a safe assumption β€” many professor files are exported from Excel as Windows-1252.

pandera β€” Schema Contracts for DataFrames

Define what your data should look like after cleaning. Run validation at every stage boundary.

import pandera as pa

clean_schema = pa.DataFrameSchema({
    "customer_id": pa.Column(str, pa.Check(lambda s: s.str.len() == 8)),
    "age": pa.Column(int, pa.Check.between(18, 100), nullable=False),
    "revenue": pa.Column(float, pa.Check.ge(0), nullable=True),
})

try:
    validated = clean_schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

Record Linkage β€” Merging When IDs Don't Match

When datasets should join but IDs are inconsistent, record linkage finds probable matches without exact keys.

import recordlinkage

indexer = recordlinkage.Index()
indexer.block(left_on="last_name", right_on="last_name")
candidate_pairs = indexer.index(df_a, df_b)

compare = recordlinkage.Compare()
compare.string("first_name", "first_name", method="jarowinkler", threshold=0.85)
compare.exact("zip_code", "zip_code")

features = compare.compute(candidate_pairs, df_a, df_b)
matches = features[features.sum(axis=1) >= 2]
print(f"Found {len(matches):,} probable matches")
Always manually review a sample. Record linkage produces probable matches, not guaranteed ones.

YouTube Tutorials

Other Tools to Learn

Python is not always the right tool. Know when to reach for something else.

SQL + dbt

Pipeline-stage transformations with version control. Learn this first after Python.

FreeAny scale
β†’ getdbt.com
OpenRefine

GUI tool for clustering similar text entries. Ideal for fixing inconsistent category spellings without writing code.

FreeSmall–Medium
β†’ openrefine.org
Apache Spark / PySpark

Distributed cleaning for truly massive datasets. When polars hits memory limits, PySpark distributes the work.

FreeMassive scale
β†’ PySpark Docs
Alteryx

Enterprise-grade visual pipeline builder. No-code ETL with built-in data quality tools.

Large
β†’ alteryx.com
Tableau Prep

Visual drag-and-drop transformation builder. Great for profiling messy data.

Medium
β†’ Tableau Prep
Power Query

M language transformation engine built into Excel and Power BI. Widely used in business environments.

Small–Medium
β†’ Power Query Docs

Scenario Selector

Filter by your main problem to find the right cleaning approach.

Mixed encoding in raw .txt files

Best approach: chardet β†’ open with detected encoding β†’ ftfy to fix mojibake

Encoding

Inconsistent category spellings ("NY", "new york", "N.Y.")

Best approach: OpenRefine clustering or rapidfuzz for automated fuzzy matching at scale

CategoriesDuplicates

1M+ rows making pandas slow or crashing

Best approach: polars with scan_csv() for lazy evaluation, or PySpark for distributed workloads

Scale

Need to validate data types after each transform

Best approach: pandera schema assertions at every pipeline stage; use lazy=True to collect all errors at once

Validation

Two datasets that won't join cleanly (IDs don't match)

Best approach: recordlinkage for probabilistic matching with blocking; always manually review a sample

Duplicates

Large number of nulls β€” not sure what to do

Best approach: Classify first (MCAR/MAR/MNAR). MCAR β†’ drop/impute. MAR β†’ KNN or regression. MNAR β†’ flag only.

Missing Data

Need to track all transformations for reproducibility

Best approach: dbt for SQL pipelines, or a Python audit log with timestamps and row counts per step

Validation

Date formats inconsistent across rows

Best approach: dateparser handles most formats automatically; fall back to regex for edge cases

Categories

Pipeline needs to run automatically without manual review

Best approach: Idempotent functions + pandera validation + row count logging. Fail loudly on violations.

ScaleMissing Data

Books & Resources

Ordered by where you are in your learning journey.

Live Sandbox

Edit the sample data below or paste your own CSV, toggle operations, and click Run to see results.

Click "Run Cleaning" to see results...