Fail Fast: Hone Your Ability to Recover and Respond Quickly

veteran-turned-software-engineer-e1485204975427

It’s close to midnight and you are about to wrap your day off. Suddenly you get a pager-duty to resolve a critical bug that’s failing some of the automated reporting emails.

You go on to check the logs in the log management tool. This is not the ideal time to find out that logs are not getting streamed to the log management service properly.

Next, you decide to check the performance metrics of the email API and you realize that you don’t know the new monitoring tool well enough to get the right metrics quickly.

That sets the theme to why as effective engineers we should fail fast and hone our abilities to recover and respond quickly to failures.

Another post I wrote on failing fast:

https://priyankvex.wordpress.com/2017/07/08/philosophy-behind-the-offensive-programming/

“The best defense against major unexpected failures is to fail often.”

Netflix knows its way around when we talk about creating reliable systems. What engineers at Netflix have done may sound counter-intuitive, but they have made a tool called Chaos Monkey. It randomly kills services in their own infrastructure.

It turns out that this strategy helps Netflix to increase site’s reliability. Failing services during office hours when all the engineers are available, helps them perform recovery drills effectively and prepares them well enough for actual emergencies.

Why is it so important to prepare for failures?

As software engineers, our systems are bound to fail at some point and some releases certainly will have some bugs. In such scenarios, learning and investing time in the ability to recover quickly becomes a high leverage activity. It gives you the confidence to move fast with your product having peace of mind that you are ready to tackle problems if they arise.

Few reasons to invest time in recovering from failures:

  1. Prepares the team to write scripts for success via mock drills.
  2. Surfaces gaping holes in the systems used for monitoring and debugging.
  3. Helps develop better tools and processes to handle emergencies.
  4. Helps control stress and panic in the cases of actual failures.

Write your contingency plans

what-if-800x435

Ask yourself “what if” questions and work-through contingency plans:

  1. What if a critical bug gets deployed with a release?
  2. What if a user raises an urgent ticket?
  3. What if my message broker goes down?
  4. What if my systems face a spike in usage?

This can be applied to even more aspects of software engineering:

  1. What if the due date for a feature gets preponed?
  2. What if a critical team member goes sick?
  3. What if there is a dilemma in the product plan and prioritization?

Conclusion

No matter how careful we are and what we are working on, things will go wrong some of the time.

The better our tools and processes for recovering quickly from failures, and the more we practice using them, the higher our confidence and the lower our stress levels will be. This allows us to move forward much more quickly.

That’s all, folks!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: