Career path
Site Reliability Engineer
A site reliability engineer (SRE) keeps large systems up, fast, and recoverable — applying software engineering to the problem of running services at scale. The role overlaps with DevOps but leans harder into reliability, measurement, and engineering away toil.
What the job actually is
You make services dependable and measure whether they actually are. That means setting reliability targets, building monitoring and alerting, automating the manual work that doesn't scale, and leading the calm response when something breaks. A defining part of the job is the blameless post-mortem — learning from failure so the same incident can't happen twice.
Skills that matter
- Strong coding — SRE is engineering, not just operations.
- Linux, networking, and distributed systems fundamentals.
- Monitoring and observability — metrics, logs, and tracing.
- Incident response — staying clear-headed under pressure.
- Automation — replacing repetitive toil with code.
How to switch in
SRE is almost always a move from elsewhere in tech — typically software engineering, DevOps, or systems administration. The path is to deepen both your coding and your operational instincts: learn how distributed systems fail, get hands-on with observability tooling, and seek out on-call experience where you handle real incidents.