Orlando Florin Rosu - Fotolia

Get started Bring yourself up to speed with our introductory content.

Why software resilience should be the real goal of DevOps

To improve your software development process, use DevOps to create a resilient system. Expert Matthew Heusser explains why reliability is no longer the goal.

Years ago, I worked for an organization that prided itself on its internal controls for software. Using a work order, developers had to get every box checked and then outline the steps to move the code to production. If there was a problem, systems would be down, and any "fix" would require a similar rigorous -- and documented -- process. That single-minded focus on reliability meant we had to batch changes together into projects and roll out less often. It became a vicious cycle.

But then came DevOps. I don't mean DevOps in the fancy CI/CD way, though. In this case, it's about dev and ops focused together on what matters, which, as is clear from the above example, cannot be reliability. It has to be software resilience.

A few years ago, my friend Noah Sussman suggested that instead of reliability, software systems should focus on resilience. Where reliability is focused on failure prevention, software resilience is more concerned that a single failure does not destroy the system.

Resilience in an e-commerce application

To understand this, look at Amazon's front page. Have you noticed it seems a bit like a group of boxes put together like Legos? There is a title bar with links to your profile and your orders. Underneath that, there are lists: "Fun gift ideas under $10," "Things you browsed recently," "Your recommendations" and so on.

Software resilience is more concerned that a single failure does not destroy the system.

Each of those containers is a combination of display code and a web service call. If the web service is down -- perhaps because a new change will roll out this very second -- then the box does not display. The Amazon homepage continues to function perfectly well, and it is likely a customer did not even notice the absence of the box. That is the classic definition of software resilience.

Combine that with an attempt to reduce mean time to recovery, and suddenly it seems we invest more energy in monitoring to find problems quickly and have a brand new emphasis on smaller changes across the industry.

Let's put that together to define software resilience:

  1. We need to no longer think about reliability.
  2. We must build systems that are resilient.
  3. To do that, our focus should be on smaller changes, which are more easily rolled back.
  4. When we deploy, we must roll out only a tiny part of a system at any given time.
  5. Our system must be designed so the failure of one component does not bring everything to a halt.

If we can accomplish that, we've found the prescription for software resilience. And now we can get to work -- without a work order.

This was last published in April 2018

Dig Deeper on DevOps and software development

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

How does your organization support the development of resilient software?
Cancel

-ADS BY GOOGLE

SearchSoftwareQuality

SearchITOperations

SearchMicroservices

TheServerSide.com

SearchDataCenter

Close