How to Set Up Monitoring for Your App

Step 1: Understand the Three Pillars of Observability

Monitoring answers "is the system working?" Observability answers "why isn't it working?" You need three types of telemetry:

Metrics — Numerical measurements over time. CPU usage, request count, error rate, response time. Cheap to collect, easy to alert on, but low detail.
Logs — Discrete events with context. "User X requested /api/lessons and got a 500 at 14:32:07." Rich detail, but expensive at scale and hard to correlate.
Traces — The journey of a single request through your system. Shows which service called which, how long each step took, and where it failed.

You don't need all three on day one. Start with structured logging and basic metrics. Add tracing when you have multiple services or complex request flows.

Step 2: Implement Structured Logging

Unstructured logs (console.log("user login failed")) are nearly useless at scale. You can't filter, search, or aggregate them. Structured logs are JSON objects with consistent fields.

Structured logging rules:

Always include a timestamp, log level, and service name
Use consistent field names across your codebase
Never log sensitive data (passwords, tokens, personal information)
Log at the right level: debug for development, info for normal operations, warn for recoverable issues, error for failures

Step 3: Track the Metrics That Matter

You don't need hundreds of metrics. Start with these four (the "golden signals"):

Latency — How long requests take (p50, p95, p99)
Traffic — How many requests per second
Errors — What percentage of requests fail
Saturation — How close your resources are to capacity

For a web application, track these at minimum:

In production, use a proper metrics library or service (Prometheus, Datadog, Vercel Analytics) rather than rolling your own. The pattern above illustrates what to track.

Step 4: Add Health Checks

A health check endpoint tells you (and your load balancer) whether the application is functioning. It should verify critical dependencies.

Hit this endpoint from your monitoring service every 30-60 seconds. If it returns 503, investigate.

Step 5: Set Up Alerting That Doesn't Cry Wolf

Bad alerting is worse than no alerting. If every alert is a false alarm, you'll ignore the real ones.

Alert on symptoms, not causes. Don't alert on "CPU is at 80%." Alert on "error rate exceeded 5% for 5 minutes" or "p95 latency exceeded 2 seconds for 10 minutes."

Alert tiers:

| Severity | Condition | Response | Example | | -------- | ------------------- | ------------------------ | ---------------------------------- | | Critical | Service is down | Page on-call immediately | Health check failing for 3 minutes | | Warning | Service is degraded | Investigate within hours | Error rate above 2% for 15 minutes | | Info | Anomaly detected | Review next business day | Traffic 50% above normal |

Rules for good alerts:

Every alert must be actionable. If you can't do anything about it, it's not an alert — it's noise.
Include context in the alert: what broke, when it started, a link to the dashboard.
Set appropriate thresholds. If it fires daily and you ignore it daily, raise the threshold.
Use alert windows (sustained conditions, not single spikes). "Error rate > 5% for 5 minutes" not "any single 500."

Step 6: Build a Debugging Workflow

When something breaks, you need a systematic approach:

Check the dashboard — Is this a spike or a sustained issue? When did it start?
Read the logs — Filter by the time window and error level. Look for the first error in the sequence.
Follow the trace — If you have tracing, find the failing request and walk through each step.
Check recent deployments — Did this start after a deploy? Check git log for the timeframe.
Reproduce locally — Can you trigger the same error with the same inputs?

Create a runbook for your most common incidents. A runbook is a step-by-step guide for diagnosing and fixing a specific type of failure. When you're paged at 2 AM, you don't want to think — you want to follow a checklist.

Monitoring is not a feature you ship once. It evolves with your application. Start simple, add instrumentation when you get bitten by something you couldn't diagnose, and delete alerts that never fire.