It’s close to midnight and you are about to wrap your day off. Suddenly you get a pager-duty to resolve a critical bug that’s failing some of the automated reporting emails.
You go on to check the logs in the log management tool. This is not the ideal time to find out that logs are not getting streamed to the log management service properly.
Next, you decide to check the performance metrics of the email API and you realize that you don’t know the new monitoring tool well enough to get the right metrics quickly.
That sets the theme to why as effective engineers we should fail fast and hone our abilities to recover and respond quickly to failures.
Another post I wrote on failing fast:
“The best defense against major unexpected failures is to fail often.”
Netflix knows its way around when we talk about creating reliable systems. What engineers at Netflix have done may sound counter-intuitive, but they have made a tool called Chaos Monkey. It randomly kills services in their own infrastructure.
It turns out that this strategy helps Netflix to increase site’s reliability. Failing services during office hours when all the engineers are available, helps them perform recovery drills effectively and prepares them well enough for actual emergencies.
Why is it so important to prepare for failures?
As software engineers, our systems are bound to fail at some point and some releases certainly will have some bugs. In such scenarios, learning and investing time in the ability to recover quickly becomes a high leverage activity. It gives you the confidence to move fast with your product having peace of mind that you are ready to tackle problems if they arise.
Few reasons to invest time in recovering from failures:
- Prepares the team to write scripts for success via mock drills.
- Surfaces gaping holes in the systems used for monitoring and debugging.
- Helps develop better tools and processes to handle emergencies.
- Helps control stress and panic in the cases of actual failures.
Write your contingency plans
Ask yourself “what if” questions and work-through contingency plans:
- What if a critical bug gets deployed with a release?
- What if a user raises an urgent ticket?
- What if my message broker goes down?
- What if my systems face a spike in usage?
This can be applied to even more aspects of software engineering:
- What if the due date for a feature gets preponed?
- What if a critical team member goes sick?
- What if there is a dilemma in the product plan and prioritization?
No matter how careful we are and what we are working on, things will go wrong some of the time.
The better our tools and processes for recovering quickly from failures, and the more we practice using them, the higher our confidence and the lower our stress levels will be. This allows us to move forward much more quickly.
That’s all, folks!