Amazon EC2: Recent Morning Outage Reflection

Recently there was an outage on the Amazon EC2 platform.  Amazon EC2 began munching on AMI's for breakfast early one morning like they were tasty waffels with bananas and 100% pure maple syrup.  It begged the question, What would you do if your non-persistent AMI's just disappeared into /dev/null space never to be seen again?

First, I want to remind people of something.  Amazon EC2 is still in "limited beta" mode.  Some people seem might be forgetting this in all the excitement.  Others take it to heart and make a different set of decisions and arrange for things to fail.  I recommend paying more attention to the latter path when working with EC2.

Second, let's see what happened and how people reacted and what Amazon had to say about it.

Around 4:30a on September 29, 2007
(AWS) The EC2 control and monitoring API is currently experiencing an outage.

API calls are currently failing, but the majority of running instances appear unaffected.

We are investigating and will post more details as soon as we have them.

At this point, AMI's were failing and the API tools were also apparently working properly so people were having quite a few problems getting their services back online.  It's important to remember here that if you "loose" an AMI that isn't backed up somehow to a persistent data store like S3 then it's just *poof*.


Long story short the instances that were erroneously terminated were not recoverable.  If you did not have either backups to restore from or some other form of redundancy then your might have found yourself in a bad position.  There is a thread on the AWS forums about all this that is worth a read.  Here is what the final statements from Amazon said.

This is an update on the EC2 issues experienced today.  A software deployment caused our management software to erroneously terminate a small number of user’s instances.  When our monitoring detected this issue, the EC2 management software and APIs were disabled to prevent further terminations.  Once we corrected the problem, we restored the management software.

We will contact users that lost instances directly by email.  At this point,  the service is fully functional, and you should be able to launch replacement instances immediately.

While we have corrected the immediate bug, we are also adding additional checks to prevent this sort of issue from recurring in the future.

We are aware of the following outstanding issues which we are working to resolve now:
1/ Some instances may get stuck in the "shutting down" state until we have completed our clean-up.  These instances will not be billed and will be fully terminated shortly.
2/ Some instances will not show their launch indexes in describe-instances API.
We will keep you posted as we resolve these remaining issues.

To address a few of the questions posed on this thread:

The availability of the EC2 APIs is very important and it remains our goal to keep them highly available.  We believe disabling the management software was the correct decision because of the risk to running instances.  This is not a decision we take lightly, and we will work to avoid having to make this choice in the future.

There was no correlation between the instance terminations, so users with redundancy built into their instance deployments would have been better able to deal with the terminations.  We also understand that failure isolation is very important and we are hard at work on additional functionality to help with this.

Please let us know if you experience any unexpected behavior.
The Amazon EC2 Team


There is also a follow up post by the folks at WeoCeo on their blog about how they handled the outage.  Very nice solution I think.

There is also the Amazon official explanation from the same forum thread on how to make sure bad things don't happen to people you love if their AMI *poofs*.

How you make your instance redundant or "ready to be replaced" is highly dependent on the services it provides and the facilities it uses. From a high-level the process could look something like:

1) Create an AMI that automatically acquires and configures its resources upon instance creation (boot up).

2) Monitor those instances or their services for availability.

3) If the service or function that the instance provides becomes unavailable, or worse, the instance becomes unavailable, then launch a new instance of the same AMI.

There are numerous different use-cases that would require tons of additional detail for each of these steps in the process. There are also use-cases that would dictate active redundancy as opposed to the standby redundancy outlined here.

Something to point out is the security of your data is always up to you no matter what.  You always have to ask the "What would happen if I unplugged this part?" 

It's very clear now that Amazon has decided invested in a customer facing AMI redundancy feature at this point in the BETA.  I really think they should consider this as a premium option however because it just makes sense.  I think people would be willing to pay a premium to KNOW that their AMI was highly available.  The technology for this should already exist in their system but I suspect it might pretty resource intensive.

In summary, I still think Amazon EC2 is a cutting edge implmentation of on-demand computing usage.  But, remember, it's in BETA or not so if you intend to use it then make heavy use of persistant storage, backups, and  high availability techniques to ensure that your data remains safe.  Possibly consider some of the 3rd party add-on services like WeoCeo, RightScale, ElasticDrive, and WeoCeo that will make a lot of it easier for you.