Blog
Designing Reliable Data Pipelines
A pipeline that runs is easy. A pipeline you can trust is the real work. Most of the difference comes down to a handful of habits that you can apply to almost any batch job.
1. Make every step idempotent
Re-running a step should produce the same result, not duplicate it. The simplest way is to write to a partition you can safely overwrite.
def write_day(df, day: str):
# Overwrite just this day's partition — re-runs are safe.
path = f"warehouse/events/date={day}"
df.write.mode("overwrite").parquet(path)
If a job dies halfway, you just run it again. No manual cleanup, no fear.
2. Validate the schema at the door
Bad data is cheapest to catch the moment it arrives. A few assertions at the ingestion boundary save hours of debugging downstream.
EXPECTED = {"user_id": "int64", "event": "object", "ts": "datetime64[ns]"}
def validate(df):
for col, dtype in EXPECTED.items():
assert col in df.columns, f"missing column: {col}"
assert str(df[col].dtype) == dtype, f"bad type for {col}"
assert df["user_id"].notna().all(), "null user_id"
3. Make it observable
You can't fix what you can't see. Every run should leave a trail.
Log the inputs, row counts, and durations of each stage. When something looks off three weeks from now, those numbers are the difference between a five-minute answer and an afternoon of guessing.
A minimal health table goes a long way:
| Signal | Why it matters |
|---|---|
| Rows in / rows out | Catches silent drops and accidental fan-out |
| Run duration | Surfaces slow degradation before it pages you |
| Freshness lag | Tells you the data is current, not just present |
4. Prefer small, recoverable steps
A pipeline of small steps with clear inputs and outputs is far easier to reason about than one giant transform. When a step fails, you re-run that step — not the whole chain.
- Break work along natural boundaries (extract, clean, aggregate, publish)
- Persist between expensive stages so you can resume
- Keep each step doing one thing
The payoff
None of this is glamorous, but together these habits turn a pipeline from "hope it worked" into "I can prove it worked." That confidence is what lets you build the next layer on top without flinching.