Uptime:  sovereign of the SLA, king of the data center, eater of weekends, destroyer of budgets.

Traditional IT architectures have been dominated with the overwhelming need to provide highly available, highly resilient infrastructures for our current crop of applications. Fault tolerance is sought at every level possible, from hardware with backup components to multiple paths for network and data to multiple cooperating compute platforms. High-availability (HA) cluster software, load balancing, and component redundancy all work to eliminate any single points of failure in the infrastructure's critical path, ensuring our application can survive the inevitable failure.

And failure is inevitable; the very availability measures we add increase the likelihood of failure since we've increased the number of components that could fail.  That conundrum is the root of a truism I heard in my years of designing HA systems:  Every 9 you add also increases the cost by an order of magnitude.

IMG_20131016_094315860

Let me back up a bit to make sure that statement is clear.  The standard measure of SLA uptime is a percentage, usually starting at 99%. That would be two 9s, and translates to being able to tolerate about 3.5 days of downtime over the course of a year for the application.  Each additional 9 comes after the decimal point, reducing the previous downtime figure by an order of magnitude. So three 9s is 99.9% or a little under nine hours and five 9s is 99.999% or about five minutes. Again, I want to stress this is the amount of time during a year of 24x7 operations (61,320 hours) that your application can be unavailable.

Downtime can be a critical impact to a business, imagine how much money a major online retailer could lose if they were down for three days or a major financial institution can lose in trades for nine hours. So it makes sense that you would weigh the impact of downtime to determine how much risk you are willing to mitigate in the infrastructure with HA strategies.

Now that we are looking to move these applications away from traditional IT environments to cloud style environments (public or private), we are dragging along the HA baggage that comes with them. We like the (Infrastructure-as-a-Service) IaaS idea because our major public cloud provider tells us about their three 9s SLA, which means we only need to pay attention to certain components and we are sitting pretty.

Unfortunately, when you replicate your traditional HA architecture in your cloud IaaS, it fails miserably.  Why?

IMG_20131016_094248628

The answer is simple:  the cloud isn't architected for your infrastructure components to be three 9s or better; it's the provider's infrastructure that will be three 9s or better.  Your individual components are disposable and fragile.  And there's not a lot you can do about it at that level.

But Matt, I've installed all the components and everything is working, so once again, you are so very wrong.

I'm glad it's working; enjoy it while it lasts.  Because it won't.  And when it doesn't, you won't have a clean path to fix it. The problem comes from the underlying design of clouds versus traditional servers, and even virtual guests.  And the fact that you are doing it wrong as a result.

Let's take a tour of three core components that are provided for you by a cloud IaaS: compute, network, and persistent block storage.  These just happen to be the touch points for HA configurations.

Cloud compute is designed to be ephemeral.  Terminating an instance can change where it resides in the infrastructure, its IP address and hostname, and who its neighbors are.  HA cluster nodes are designed to be permanent.  It is a major problem if a node changes IP or hostname on restart after a termination.  It's a big problem if a node is suddenly competing for resources with other processes.

Cloud network is designed to be transparent but available.  Few IaaS providers guarantee bandwidth between the various different segregation points in their infrastructure.  There are several layers of networking potentially hidden from view: inter-compute instance, inter-hypervisor host, and intra- and inter-datacenter at a minimum.  At any one point in time, your cloud compute instances will receive a reasonable amount of bandwidth and latency.  However, noisy neighbors, path hops, and other unknown sources can drive short-term latency extremely high.  For general usage, this doesn't pose a problem.  In HA heartbeat networks, latency fluctuations are deadly.  And going to a quorum disk won't help because most IaaS storage is network based as well.  Oh, and multicast won't save you because most providers only allow unicast UDP traffic to make things simpler and quieter in the network.

Persistent block storage for most IaaS environments is really the least of our problems.  These stores tend to be architected to be reliable. Since they show up as block devices, we can add all the standard HA goodies we normally would, like a clustered filesystem across multiple devices in an LVM mirror.  We can also dial in I/O patterns with dollars.  So here, not so much of an issue, but we can't usually attach block storage to more than one compute instance, so there goes quorum disks again.

Well then, with the doom and gloom complete, am I suggesting that to go IaaS is to assume massive amounts of downtime risk?  No.

My point is you need to change your focus to get application uptime in a cloud environment.  In fact, you need to realize that your fancy and expensive HA solution was nothing but a band aid.

Multi-million dollar band aids?

Matt, you've obviously gone off the rails.  Go have a cup of coffee and come back when you are thinking right.

Yes, band aids.  Traditional HA addresses the infrastructure with the assumption that the application is fragile and can't do the right thing to be available.  We have to make the underlying components resilient because there aren't any development requirements for resilient applications.  Unless you are dealing with massively expensive systems, most traditional HA solutions consist of highly redundant physical components with a software layer that knows how to determine if an application is not responding and to restart it really fast.  That's it. Quick restarts.  Oh, and some intelligence to route around a lower level failure.

So Matt, my doubly redundant HA cluster just "turns it off and on again?"

In order to get any other level of HA, you need to address the application, not the container.  Session state, data coherency caching, multi-access data tables, all of these are attempts to get more application level redundancy in place.  The realm of active-active HA is often glossed over by describing it as "all traffic to all nodes," but the need for an application to be able to function in that manner is often left unspoken.

In the cloud, you can't ignore the requirement of availability in your application.  It now becomes a basic development requirement, just like the UI and the business logic.  "Active-active HA" is the normative model in IaaS environments.  Your application has to be able to tolerate a node disappearing, a network partition arising.  Your application has to do the right thing based on your tolerance for downtime and lost transactions.  And this is, by the way, the only way you get to take advantage of the elastic nature of cloud resources.  Auto-scaling and fault tolerance go hand-in-hand.

So look to the cloud for your applications, but understand that your traditional IT architectures won't necessarily apply.  It's time to rip off the "Operations will handle it" band aid.