Build an ETL Pipeline

Overview

ETL (Extract, Transform, Load) pipelines are the backbone of data engineering. They pull raw data from sources, clean and reshape it, and load it into a destination where analysts and applications can use it. In this tutorial, you'll build a production-grade ETL pipeline in Python that extracts data from a REST API, transforms it with validation and enrichment, loads it into PostgreSQL, runs on a schedule, and handles failures gracefully with retries and dead letter logging.

What you'll learn:

Structuring ETL code for maintainability and testability
Paginated API extraction with rate limiting
Data validation and transformation with type safety
PostgreSQL bulk loading with upsert (INSERT ON CONFLICT)
Cron scheduling with state tracking
Retry logic with exponential backoff
Dead letter queues for unprocessable records

The pipeline extracts user activity data from a mock API, normalizes timestamps, computes derived metrics, and loads everything into a reporting table.

Step 1: Project Setup

Create the project structure:

Set up .env:

Step 2: Database Schema

Step 3: Extraction Layer

Step 4: Transformation Layer

Step 5: Load Layer

Step 6: Retry Logic

Step 7: Dead Letter Queue

Step 8: Pipeline Orchestrator

Step 9: Scheduler

Step 10: CLI Runner and Cron Integration

For cron-based scheduling, add to your crontab:

Run the pipeline manually with python run.py run. Check dead letters with python run.py dead-letters. Start the built-in scheduler with python run.py schedule 300 for a five-minute interval. The pipeline tracks every run, retries transient failures automatically, and preserves unprocessable records for inspection. From here, extend it with parallel extraction, incremental loading using watermarks, or a web dashboard querying the pipeline_runs table.

Overview

What you'll learn:

Structuring ETL code for maintainability and testability

Paginated API extraction with rate limiting

Data validation and transformation with type safety

PostgreSQL bulk loading with upsert (INSERT ON CONFLICT)

Cron scheduling with state tracking

Retry logic with exponential backoff

Dead letter queues for unprocessable records

The pipeline extracts user activity data from a mock API, normalizes timestamps, computes derived metrics, and loads everything into a reporting table.

Step 10: CLI Runner and Cron Integration

For cron-based scheduling, add to your crontab:

# transform.py from datetime import datetime, timezone from typing import Optional class TransformError(Exception): def __init__(self, message: str, record: dict): super().__init__(message) self.record = record def validate_record(record: dict) -> bool: """Validate that a record has all required fields.""" required = ["id", "userId", "title", "body"] return all(key in record for key in required) def compute_word_count(text: str) -> int: """Count words in a text string.""" return len(text.split()) if text else 0 def normalize_text(text: str) -> str: """Clean and normalize text content.""" return " ".join(text.strip().split()) def transform_record(record: dict) -> Optional[dict]: """Transform a raw API record into the target schema. Returns None if the record should be skipped. Raises TransformError for invalid records. """ if not validate_record(record): raise TransformError(f"Missing required fields in record {record.get('id')}", record) title = normalize_text(record["title"]) body = normalize_text(record["body"]) if len(title) < 3: raise TransformError(f"Title too short: '{title}'", record) return { "user_id": int(record["userId"]), "post_id": int(record["id"]), "activity_type": "post", "title": title.title(), "body": body, "word_count": compute_word_count(body), "extracted_at": datetime.now(timezone.utc).isoformat(), } def transform_batch(records: list[dict]) -> tuple[list[dict], list[tuple[dict, str]]]: """Transform a batch of records. Returns (transformed_records, failed_records_with_errors). """ transformed = [] failures = [] for record in records: try: result = transform_record(record) if result is not None: transformed.append(result) except TransformError as e: failures.append((e.record, str(e))) return transformed, failures

# pipeline.py from datetime import datetime, timezone from extract import extract_all from transform import transform_batch from load import get_connection, ensure_schema, load_records from dead_letter import write_dead_letters from retry import with_retry from config import MAX_RETRIES @with_retry(max_attempts=MAX_RETRIES, retryable_exceptions=(ConnectionError, TimeoutError)) def run_extraction() -> list[dict]: return extract_all() def run_pipeline() -> dict: """Execute the full ETL pipeline with tracking.""" conn = get_connection() ensure_schema(conn) # Create pipeline run record run_id = create_run(conn) started_at = datetime.now(timezone.utc) try: # Extract print("[extract] Starting extraction...") raw_records = run_extraction() print(f"[extract] Got {len(raw_records)} records") update_run(conn, run_id, records_extracted=len(raw_records)) # Transform print("[transform] Transforming records...") transformed, failures = transform_batch(raw_records) print(f"[transform] Transformed: {len(transformed)}, Failed: {len(failures)}") # Load print("[load] Loading into database...") loaded = load_records(conn, transformed) print(f"[load] Loaded {loaded} records") # Dead letters if failures: dl_count = write_dead_letters(run_id, failures) print(f"[dead-letter] Wrote {dl_count} failed records") # Finalize complete_run(conn, run_id, "success", loaded, len(failures)) return { "status": "success", "extracted": len(raw_records), "loaded": loaded, "errors": len(failures), "duration_seconds": (datetime.now(timezone.utc) - started_at).total_seconds(), } except Exception as e: complete_run(conn, run_id, "failed", 0, 1, str(e)) raise finally: conn.close() def create_run(conn) -> int: with conn.cursor() as cur: cur.execute( "INSERT INTO pipeline_runs (started_at, status) VALUES (%s, %s) RETURNING id", (datetime.now(timezone.utc), "running"), ) run_id = cur.fetchone()[0] conn.commit() return run_id def update_run(conn, run_id: int, **kwargs) -> None: sets = ", ".join(f"{k} = %s" for k in kwargs) values = list(kwargs.values()) + [run_id] with conn.cursor() as cur: cur.execute(f"UPDATE pipeline_runs SET {sets} WHERE id = %s", values) conn.commit() def complete_run(conn, run_id: int, status: str, loaded: int, errors: int, error_msg: str = None) -> None: with conn.cursor() as cur: cur.execute( "UPDATE pipeline_runs SET completed_at = %s, status = %s, records_loaded = %s, errors = %s, error_message = %s WHERE id = %s", (datetime.now(timezone.utc), status, loaded, errors, error_msg, run_id), ) conn.commit()