Back to blog

Blog

Designing Reliable Data Pipelines

Shivank Poudel2 min read
data-engineeringetlreliability

A pipeline that runs is easy. A pipeline you can trust is the real work. Most of the difference comes down to a handful of habits that you can apply to almost any batch job.

1. Make every step idempotent

Re-running a step should produce the same result, not duplicate it. The simplest way is to write to a partition you can safely overwrite.

def write_day(df, day: str):
    # Overwrite just this day's partition — re-runs are safe.
    path = f"warehouse/events/date={day}"
    df.write.mode("overwrite").parquet(path)

If a job dies halfway, you just run it again. No manual cleanup, no fear.

2. Validate the schema at the door

Bad data is cheapest to catch the moment it arrives. A few assertions at the ingestion boundary save hours of debugging downstream.

EXPECTED = {"user_id": "int64", "event": "object", "ts": "datetime64[ns]"}

def validate(df):
    for col, dtype in EXPECTED.items():
        assert col in df.columns, f"missing column: {col}"
        assert str(df[col].dtype) == dtype, f"bad type for {col}"
    assert df["user_id"].notna().all(), "null user_id"

3. Make it observable

You can't fix what you can't see. Every run should leave a trail.

Log the inputs, row counts, and durations of each stage. When something looks off three weeks from now, those numbers are the difference between a five-minute answer and an afternoon of guessing.

A minimal health table goes a long way:

Signal Why it matters
Rows in / rows out Catches silent drops and accidental fan-out
Run duration Surfaces slow degradation before it pages you
Freshness lag Tells you the data is current, not just present

4. Prefer small, recoverable steps

A pipeline of small steps with clear inputs and outputs is far easier to reason about than one giant transform. When a step fails, you re-run that step — not the whole chain.

  • Break work along natural boundaries (extract, clean, aggregate, publish)
  • Persist between expensive stages so you can resume
  • Keep each step doing one thing

The payoff

None of this is glamorous, but together these habits turn a pipeline from "hope it worked" into "I can prove it worked." That confidence is what lets you build the next layer on top without flinching.