Customers assume your system will just work, that 100% availability is reasonable and no questions are asked. But when something breaks, there can be organisational panic, finger pointing and demands it gets fixed ASAP. Your customers are frustrated, pressure builds from the product teams and your tech teams are stressed. You are not meeting your customer’s unspoken expectations. It’s important to get ahead of the game.
The Unspoken Expectation
Just because customers, internal teams and senior leadership might not explicitly ask for reliability doesn’t mean they don’t expect it. Whether they are using a SaaS product, a website, or some cloud service, they just assume it’ll be there, working perfectly, all the time.
For those of us in the tech side of things, that expectation can be a bit of a double-edged sword. We’re the ones who have to make sure everything runs smoothly, but no one notices until something goes wrong. And when it does, well, that’s when the pressure starts to mount.
The Sudden Demand and Internal Pressure
The moment something goes down, everything changes. Customers make noise on your social media platforms, the product and customer teams are breathing down your neck, and suddenly the whole business is looking at you to sort it out. It’s not just the customers you’ve got to worry about - internally, the pressure can be even harder to deal with. The product teams might see their plans go sideways, customer service is dealing with a flood of complaints, and the executives are freaking out about the bottom line and the company’s reputation.
In those moments, it’s easy to feel like everyone’s pointing the finger at the tech team (resist the urge to say, ‘I told you so’). Crisis is an opportunity! Don’t panic in the moment; use the moment to educate or remind your organisation that reliability needs to be a tenet of all executive roles and front of mind for everyone.
The real trick isn’t just about fixing the problem quickly, it’s about getting your organisation to understand the real value of your SRE team and why investing in good engineering design is so important.
In real terms, it can come down to the cost of earlier trade-offs. When people decide to cut corners or skip proper design and testing, they gamble with reliability. And when that bet doesn’t pay off, it’s the tech team left holding the bag. So, part of your job is ensuring everyone understands that putting in the effort upfront saves a lot of headaches down the line.
Staying Ahead
If you want to keep everyone off your back and keep things running smoothly, you’ve got to be proactive about reliability. Here’s how to do it:
Learn from Mistakes - Post-Mortem Analysis: After things have settled down, take a good look at what happened and why. Figure out how to stop it from happening again. And remember, it’s about learning, not blaming (difficult questions can still be asked and responded to, however a safe environment is critical).
Plan for the Worst - Design for Failure: Assume things are going to break at some point. Build your systems so they can handle failure without everything going wrong. Think about redundancies, load balancers, and backup plans.
Make Sure You Can See What’s Going On - Observability: You can’t fix what you can’t see. Observability is key. Set up your systems so you’ve got a clear view of what’s happening under the hood - logs, metrics, traces. It helps you understand how your system is behaving, and stops fires before they start.
Keep an Eye on Things - Monitoring and Alerting: Get some solid monitoring and alerting systems in place. That way, you can spot issues before they turn into full-blown disasters, and you’re not always playing catch-up.
Automate Where You Can - Automated Recovery: If you can automate recoveries, do it. Things like restarts, failovers, and scaling should happen automatically so downtime is minimal, and you can keep everything ticking along nicely.
Be Ready for Anything - Incident Response Planning & Practice: Have a plan for when things do go wrong. Make sure everyone practises the drill and how to keep everyone in the loop.
Be Proactive Instead of Reactive - Reduce Risk of Failure: Incorporate proactive actions into your engineering strategy and principles. Establish clear and attainable engineering standards. Provide clear architectural guidelines and invest in templates that streamline these to minimise the mental workload of engineers. Formalise and automate processes to prevent human error.
Wrapping It Up
Reliability might not be something people talk about or fund appropriately, but it’s always expected, and when it’s missing, if you are responsible for it you’ll know about it. The pressure from customers and teams within the business can hit hard when things go wrong. But if you’re proactive with your approach, you can avoid much of that stress and keep everyone happy, inside and outside the company. In the end, reliability isn’t just about keeping things running smoothly; it’s about making sure people can trust you, whether they’re using your service or working alongside you.
Check out the Hierarchy of Engineering Needs to see how it can help you plan for reliable engineering systems.
Comments