Data Cleaning is Pipeline Architecture
Before any tool or tutorial: every transformation you apply is a contract β it must be traceable, repeatable, and scoped. One undocumented transformation buried in a loop can silently corrupt downstream analysis.
Scale Awareness
Working with 1.8M+ row raw .txt files requires different tools than a 10K CSV. Know when to swap pandas for polars or PySpark.
Reproducibility
Every cleaning function should be idempotent β run it 10 times, get the same output. No side effects, no silent mutations.
Audit Trails
Document every decision: why a null was dropped, why a column was renamed, why a row was excluded. Future you will thank you.
Raw Data is Sacred
Never overwrite your source files. The raw zone is read-only. Always. Even if your professor gives you a single messy .txt file.
ydata-profiling or sweetviz generate this automatically.Pipeline Zone Architecture
No matter how long your pipeline is, you must define zones of trust. These zones apply whether you're using Python, SQL, dbt, or a no-code tool.
Untouched source data. Never modified. Store read-only. Hash-checksum it.
data/raw/source_2024.txtTransformations happen here. Every step logged. Types inferred. Nulls handled. Duplicates removed.
data/staging/cleaned_v1.csvValidated, typed, deduplicated. Schema assertions pass. Row count reconciled.
data/clean/final.csvDerived only from the clean zone. Never sourced from staging or raw.
notebooks/analysis.ipynbEnforcing Zones Without a Database
project/
βββ data/
β βββ raw/ β READ ONLY. Never touch after initial save.
β βββ staging/ β Working copies, intermediate transforms
β βββ clean/ β Final validated output
βββ notebooks/
β βββ 01_profile.ipynb
β βββ 02_clean.ipynb
β βββ 03_analysis.ipynb
βββ scripts/
β βββ clean_pipeline.py
βββ logs/
βββ transform_audit.logdata/raw/, it is wrong. Make raw read-only at the OS level: chmod 444 data/raw/* on Mac/Linux.Data Integrity Practices
These practices apply even when you have no database, no schema enforcement, nothing but Python and a text file.
Hash the raw file before and after loading to confirm you haven't mutated it.
import hashlib
def file_hash(path):
h = hashlib.md5()
with open(path, 'rb') as f:
h.update(f.read())
return h.hexdigest()
raw_hash = file_hash('data/raw/source.txt')
print(f"Raw hash: {raw_hash}")
# Save this hash. If it ever changes, someone touched raw data.After every transformation step, assert expected column types and value ranges.
import pandera as pa
schema = pa.DataFrameSchema({
"age": pa.Column(int, pa.Check.between(0, 120)),
"email": pa.Column(str, pa.Check(lambda s: s.str.contains('@'))),
"revenue": pa.Column(float, pa.Check.ge(0), nullable=True),
})
validated_df = schema.validate(df)Log input row count vs. output row count at every stage. Any unexplained drop is a bug.
def log_transform(df_in, df_out, step_name):
dropped = len(df_in) - len(df_out)
print(f"[{step_name}] IN: {len(df_in):,} | OUT: {len(df_out):,} | DROPPED: {dropped:,}")
if dropped < 0:
print(f" β οΈ Row count INCREASED β check for unintended duplications")
return df_out
df_clean = log_transform(df_raw, df_deduped, "deduplication")Document why each null exists. There are different types of missing data β confusing them leads to wrong imputation.
| Type | Meaning | Correct Action |
|---|---|---|
| MCAR | Missing Completely At Random β no pattern | Safe to drop or simple impute |
| MAR | Missing At Random β pattern in other columns | Model-based imputation (KNN, regression) |
| MNAR | Missing Not At Random β absence IS the data | Flag it; don't impute blindly |
| Introduced | Created by a bad join or parsing error | Fix the upstream transform |
Every cleaning function must produce the same output no matter how many times it runs on the same input.
# β
Idempotent β safe to re-run
def normalize_emails(df):
df = df.copy() # never mutate in place
df['email'] = df['email'].str.lower().str.strip()
return df
# β NOT idempotent β appends a suffix every run
def tag_rows(df):
df['label'] = df['label'] + '_processed' # stacks on reruns
return dfPython Deep Dive
You know pandas β now level up to the tools that handle 1.8M rows without choking.
Polars β 10β50x Faster Than Pandas
Rust-backed DataFrame library with lazy evaluation. For 1M+ row files, this is no longer optional.
import polars as pl
df = (
pl.scan_csv("data/raw/large_dataset.csv") # doesn't load into memory yet
.filter(pl.col("revenue") > 0)
.with_columns([
pl.col("email").str.to_lowercase().alias("email_clean"),
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date_parsed"),
])
.drop_nulls(subset=["customer_id"])
.collect() # NOW it executes
)
print(df.shape)scan_csv() instead of read_csv() for large files β it streams and only loads what the query needs.pyjanitor β Readable Pipeline Chains
Method-chaining API built on pandas. Makes cleaning pipelines read like prose.
import janitor
import pandas as pd
df = (
pd.read_csv("data/raw/messy.csv")
.clean_names()
.remove_empty()
.rename_column("cust_nm", "customer_name")
.drop_duplicates(subset="customer_id")
.filter_on("revenue > 0")
.fill_empty(column_names=["region"], value="UNKNOWN")
)ftfy β Fix Raw Text Encoding Problems
When your professor gives you a raw .txt file and strings look like Γ’β¬β’ β that's broken Unicode. ftfy fixes it automatically.
import ftfy, chardet
with open("data/raw/source.txt", "rb") as f:
detected = chardet.detect(f.read(100000))
print(detected) # {'encoding': 'Windows-1252', 'confidence': 0.73}
with open("data/raw/source.txt", encoding=detected['encoding'], errors='replace') as f:
raw_text = f.read()
fixed_text = ftfy.fix_text(raw_text)
df['description'] = df['description'].apply(ftfy.fix_text)Windows-1252.pandera β Schema Contracts for DataFrames
Define what your data should look like after cleaning. Run validation at every stage boundary.
import pandera as pa
clean_schema = pa.DataFrameSchema({
"customer_id": pa.Column(str, pa.Check(lambda s: s.str.len() == 8)),
"age": pa.Column(int, pa.Check.between(18, 100), nullable=False),
"revenue": pa.Column(float, pa.Check.ge(0), nullable=True),
})
try:
validated = clean_schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
print(e.failure_cases)Record Linkage β Merging When IDs Don't Match
When datasets should join but IDs are inconsistent, record linkage finds probable matches without exact keys.
import recordlinkage
indexer = recordlinkage.Index()
indexer.block(left_on="last_name", right_on="last_name")
candidate_pairs = indexer.index(df_a, df_b)
compare = recordlinkage.Compare()
compare.string("first_name", "first_name", method="jarowinkler", threshold=0.85)
compare.exact("zip_code", "zip_code")
features = compare.compute(candidate_pairs, df_a, df_b)
matches = features[features.sum(axis=1) >= 2]
print(f"Found {len(matches):,} probable matches")YouTube Tutorials
Other Tools to Learn
Python is not always the right tool. Know when to reach for something else.
Pipeline-stage transformations with version control. Learn this first after Python.
β getdbt.comGUI tool for clustering similar text entries. Ideal for fixing inconsistent category spellings without writing code.
β openrefine.orgDistributed cleaning for truly massive datasets. When polars hits memory limits, PySpark distributes the work.
β PySpark DocsEnterprise-grade visual pipeline builder. No-code ETL with built-in data quality tools.
β alteryx.comVisual drag-and-drop transformation builder. Great for profiling messy data.
β Tableau PrepM language transformation engine built into Excel and Power BI. Widely used in business environments.
β Power Query DocsScenario Selector
Filter by your main problem to find the right cleaning approach.
Mixed encoding in raw .txt files
Best approach: chardet β open with detected encoding β ftfy to fix mojibake
Inconsistent category spellings ("NY", "new york", "N.Y.")
Best approach: OpenRefine clustering or rapidfuzz for automated fuzzy matching at scale
1M+ rows making pandas slow or crashing
Best approach: polars with scan_csv() for lazy evaluation, or PySpark for distributed workloads
Need to validate data types after each transform
Best approach: pandera schema assertions at every pipeline stage; use lazy=True to collect all errors at once
Two datasets that won't join cleanly (IDs don't match)
Best approach: recordlinkage for probabilistic matching with blocking; always manually review a sample
Large number of nulls β not sure what to do
Best approach: Classify first (MCAR/MAR/MNAR). MCAR β drop/impute. MAR β KNN or regression. MNAR β flag only.
Missing DataNeed to track all transformations for reproducibility
Best approach: dbt for SQL pipelines, or a Python audit log with timestamps and row counts per step
Date formats inconsistent across rows
Best approach: dateparser handles most formats automatically; fall back to regex for edge cases
Pipeline needs to run automatically without manual review
Best approach: Idempotent functions + pandera validation + row count logging. Fail loudly on violations.
Books & Resources
Ordered by where you are in your learning journey.
Covers Python, R, and CLI tools for real-world messy data. Start here.
The pandas bible, written by its creator.
Bridges the gap between cleaning and insight. Full lifecycle coverage.
Read when you're ready to think in proper pipeline architecture at scale.
Best reference for lazy evaluation, expressions, and large-file efficiency.
Schema validation, custom checks, and CI pipeline integration.
Live Sandbox
Edit the sample data below or paste your own CSV, toggle operations, and click Run to see results.