Building to Fail: A Success Story

The days of assuming that your systems are reliable are over.

It was a fallacy to believe it was possible to build something reliable in the first place. Instead, you need to design to recover from failure, and you have to build that recovery capability into every layer of your system so it can recover gracefully from something breaking unexpectedly.

 

Cloud Architecture = Elastic Resources

One of the things we can do with a cloud infrastructure is dynamically add and remove resources as needed. You don’t have the limitation of only having [X] physical machines to fit everything into. If the system needs to process an unusually large chunk of data, you can instantly spin up as many servers as it takes to handle the load, and then spin them back down again when you’re done.

Another benefit of the cloud that we take full advantage of is the ability to have a distributed architecture where all data is replicated across multiple data centers. This gives the system several options for grabbing whatever it needs to bring everything back to the state it was in before a failure. We’re never dependent on a single hard drive or processor or connection.

The recovery process requires a spike in resources, so when we have this kind of scalable environment, we can handle recoveries smoothly without slowing everything else down.

With these two features together, we can fail over to alternate sources of data very quickly with seemingly no pause in service from the user’s point of view. Even when things fall down, our services remain uninterrupted.

An important point here: not all cloud solutions are built like this. Do your due diligence. Look under the hood. Make sure it’s not just another monolithic set of software applications hosted in a data center.

 

Chaos Monkey

In a development environment, you’re always committing new code and you should have lots of tests — unit tests, integration tests, etc. — but what you can’t test very well is the randomness of things breaking in the real world. One of the tools we use to make sure things recover is a service that runs through our infrastructure and randomly shuts things down to see what happens (an open-source tool called Chaos Monkey).

2-5615728c00001acb303171bf.chaosMonkey_400x400Netflix created Chaos Monkey because they needed a way to make sure they could continue streaming a movie to a customer no matter what happens. If a server goes down in the middle of streaming and that server has the only copy of that movie, they’re going to have a very unhappy cinema junkie. Netflix uses a distributed architecture where each movie is stored in more than one location along with records of who’s watching which movie and how far through it they are.

If the server streaming a movie falls down, a process detects what happened and how far through the movie you were, then it fails over to another server with the same movie. That server then picks up the movie at the same place where the failure occurred and starts streaming those packets before the buffering runs out. Customers watching the movie never even know that a service failure happened.

Netflix has Chaos Monkey running (and causing failures) all the time in all their environments.

When you run failure tests over time, you’re going to find a lot of weird cases. This process helps our people sleep at night because they’ve seen all the worst problems they can face if things fail, and they’ve had a chance to fix them before a customer ever even encounters them. And more importantly, the recovery process gets exercised all the time so we know it works very well.

 

Continuous Development, Continuous Deployment

Another way that we build to withstand failure is with our continuous development and deployment methodology. If you wait for big software releases to deliver new code to customers, you’re not going to be agile about pushing fixes. We’re pushing new code into PureCloud all the time, so if something small comes up, we can just fix it and push out a patch.

If something big comes up, we can just roll it back to the last working build — which is the build we pushed out yesterday or the day before, not four months ago — until we can make the fix and push a new version. Since we push builds so often, the differences between them are nearly imperceptible to the user, so the inconvenience of rolling back a version for a day or two is a minor one.

Combined with our distributed architecture, this means we can do rolling updates without taking the whole system down. We can make updates to 10 percent of the servers at a time for instance. As the newly updated servers are available, we use load balancing to move customer events to the newly updated systems so we can update the ones we just moved them off of. It’s basically the same mechanism as a failover.

 

Success through Failing

Human beings hate to fail, but that’s how we learn. It’s the lessons we learn the hard way that really stick with us, and it’s those lessons that allowed Interactive Intelligence to develop PureCloud.

“The days of assuming that your systems are reliable are over.
Instead, you need to design to recover from failure,
and you have to build that recovery capability
into every layer of your system.”

We bet the entire company on it. We spent a crap-load of money to build it from scratch. But that’s because we knew we would either fail with our existing architecture, or potentially succeed by designing something totally new – something that was built to fail.

 

Interactive Intelligence PureCloudSM is our latest microservice-based customer engagement cloud platform, a subscription-based scalable system that also helps the rest of your business stay in touch through a rich corporate directory, ‘big data’ search, ad hoc and rules-based groups, chat, video chat, and document sharing features.

Randolph Carter

Randolph Carter

An industrial designer gone bad from years of UX architecture wrangling, Randy Carter is senior marketing content architect for Interactive Intelligence. He never stops thinking about how to help customers make their systems more understandable, more polite, and more useful.