top of page
Writer's pictureMyles Henaghan

Outages > Incidents > Resilience

🚨You're coming into work stressed about (yet another) production incident. People have questions, but the answers focus on yesterday's drama rather than the pattern of declining reliability. 🥛Glass half-full: at least your alerting, on-call and incident management process is working well. Sound familiar? Well done, you're passing stage 1 of 3 on software reliability, time for stage 2.


  • Stage 1: 🔦Lights on, get organised 

  • Stage 2: 🎄Managing Reliability

  • Stage 3: 🩺Paying attention



Stage 1: 🔦Lights on, get organised 🔔


Find out before the customer when a system fails. Heavy focus on observability, tooling, and process (checks, alerts, on-call process, incident management).


Stage 2: 🎄Managing Reliability 


Dashboards lit up like a Christmas tree. Alert fatigue. Begin managing reliability targets alongside delivery, quality and security. Focus on incremental improvement through SLIs, SLOs, and SLAs. 


Stage 3: 🩺Paying micro attention ⚠️


Past performance is not an indicator of future performance. A 99.999 uptime service can and will still fail. Signals of tomorrow's outage are available today. Focus on daily log hygiene, anomalies, learning from other team's incidents, and game days for preparation.



Originally published on LinkedIn.

2 views0 comments

Kommentare


Die Kommentarfunktion wurde abgeschaltet.
bottom of page