On February 28, 2017, Amazon experienced a major S3 downtime incident on their Northern Virginia data center, which ripped the backend storage out from many online applications. As you might expect, the S3 outage also affected EC2 customers. Some of the companies reported to have been affected include Expedia, Yahoo, Autodesk, Citrix, Github, Imgur, Twitch, and many others.
Many of you watching this video may have had your own systems affected, or you may have been locked out of some online services that you use every day.
Now that the dust has settled, there are a few things that we can all take away from this incident.
The first one is that you should never have a single point of failure within your IT infrastructure. In the case of Platform-as-a-Service like AWS, this can be a bit tricky. But ideally, you should have a redundant emergency option for all your IT systems. This should include separate infrastructure, at a separate location, managed by a separate organization.
Another important take-away is how fast the news spread about this outage. Within minutes, word began to spread all over social media, and was quickly picked up by the press. And the entire conversation took place outside of Amazon’s control.
When your systems go down, the same thing will happen to your company. Gone are the days when people would wait until tomorrow morning to read about what happened today. Your entire reputation can be permanently destroyed in the time it takes to write a press release.
And when it comes to downtime, your clients are more impatient than ever. 20 years ago, your clients would gladly call a 1-800 number and wait on hold for half an hour to get service. Today, they demand immediate access to self-service portals 24 hours per day.
As soon as your systems go offline, your most loyal clients can lose confidence and start shopping around for new options. When a data disaster occurs, being able to recover your data is no longer enough. Now, you also need to recover quickly.
Now is the time to take action and make some important decisions about your IT management. How much downtime are you willing to tolerate, and what measures are you willing to take to ensure the proper levels of resiliency?
Remember. When your provider goes down, customers will blame YOU. Not your provider. Ultimately, it’s up to YOU to make sure you have a proper recovery plan in place.
What’s your failover plan?