🚨You're coming into work stressed about (yet another) production incident. People have questions, but the answers focus on yesterday's drama rather than the pattern of declining reliability. 🥛Glass half-full: at least your alerting, on-call and incident management process is working well. Sound familiar? Well done, you're passing stage 1 of 3 on software reliability, time for stage 2.
Stage 1: 🔦Lights on, get organised
Stage 2: 🎄Managing Reliability
Stage 3: 🩺Paying attention
Stage 1: 🔦Lights on, get organised 🔔
Find out before the customer when a system fails. Heavy focus on observability, tooling, and process (checks, alerts, on-call process, incident management).
Stage 2: 🎄Managing Reliability
Dashboards lit up like a Christmas tree. Alert fatigue. Begin managing reliability targets alongside delivery, quality and security. Focus on incremental improvement through SLIs, SLOs, and SLAs.
Stage 3: 🩺Paying micro attention ⚠️
Past performance is not an indicator of future performance. A 99.999 uptime service can and will still fail. Signals of tomorrow's outage are available today. Focus on daily log hygiene, anomalies, learning from other team's incidents, and game days for preparation.
Originally published on LinkedIn.
Kommentare