Loading
Keep the lights on at internet scale.
SREs apply software engineering to operations problems. You define SLOs, build automation to maintain reliability, manage incident response, and eliminate toil. When systems go down at 3 AM, you are the one people call — and you build tools so it happens less.
“You start by checking the SLO dashboard and reviewing last night's incidents. Mid-morning, you write a postmortem for yesterday's outage and implement automation to prevent recurrence. After lunch, you run a chaos engineering experiment, tune alert thresholds to reduce noise, and mentor a junior engineer on on-call best practices.”
10 required
16 required
23 required
When you complete this track, you'll have built:
The foundational text on Site Reliability Engineering practices and principles.
Google Cloud / DORA
Industry-standard metrics for measuring software delivery and operational performance.
CNCF
Vendor-neutral standard for distributed tracing, metrics, and logs.
Roles you can grow into from here.