Architectural proposal for highly resilient cloud infrastructures on Azure

Amazon’s AWS outage teached us that having mission-critical business systems running in the cloud doesn’t mean you don’t have to plan for high availability. Learn what you need to consider in this blog post.

One week ago, Amazon’s AWS customers impressively experienced what it means to depend on a cloud provider’s infrastructure as the US-EAST-1 region came down due to an employee’s error. Since, I’ve had lots of discussions about resiliency, reliability and redundancy of cloud infrastructures, especially of Microsoft’s Azure and Office 365 datacenters. But what happened? And is there any possibility to avoid your business coming down if there is a hardware outage in one of the Azure/Office 365 regions?

What happened?

Within the scope of a maintenance window Amazon’s Simple Storage Service (S3) team inadvertently removed a larger set of servers than intended what caused a series of reactions and finally lead to an outage in the storage system. At the end of the day, the outage only affected one (!) datacenter region but caused tremendous damage. Now what can you, the customers, do in order to keep your business running even if one datacenter region crashes? Well, there are several options.

Common sense and hybrid BCDR concepts

As I had already explained in a blog post from early summer 2016  common sense is one of the most important qualities an admin must have in order to be allowed to administer your IT infrastructure. The same is true for planning cloud architectures. What I often do in my projects is to use a diversity of BCDR mechanisms. First of all: going hybrid is a good option of reaching high availability.

Bildschirmfoto 2017-03-07 um 10.59.18
Azure BCDR mechanisms for hybrid use
  1. Using Azure Backup with a centralized backup service like System Center Data Protection Manager (SCDPM) or Microsoft Azure Backup Server (MABS) helps you to get rid of your backup tapes. Short-term backup is stored locally on disks, long-term archiving goes to Azure storage. With instant file restore you can restore files from recovery points in virtually no time.
  2. Unstructured data, e.g. the files in your file servers are stored on StorSimple appliances that protect your data directly to Azure not using local space on your productive SAN or your backup storage system. Furthermore with Meta Restore you r file servers are back online after an outage in a few minutes. I’m going to provide you further information on this in one of my next blog posts.
  3. Guest clusters that have to be site resilient are built geo-redundantly. Exchange DAGs for example can spread several datacenters enabling you to automatically failover to the second datacenter if the first site crashes. Or your go hybrid to Exchange Online/Office 365 as this service is geo-redundant and thus highly available by itself. SQL databases can replicate to Azure using SQL Always On.
  4. VMs and physical servers that are not part of a failover cluster but that have to come back online with a low recovery time objective (RTO) are replicated to Azure compute using Azure Site Recovery (ASR).

With those four methods you can go quite far in terms of availability and reliability as the chance of having your datacenter and the Azure region of your choice come down at the same time is fairly slim.

But what do you have consider if Azure is not only your disaster site but your productive environment?

VMs are placed in Availability Sets

Single-instance VMs are not highly available, neither on-premise nor in Azure. Yes, there are people who state a single Exchange server running on a Hyper-V failover cluster is good enough but a failover from one cluster node to a second one never is non-interuptive and shouldn’t be done with database servers. In Azure you put VMs of the same type or function into Availability Sets. Using update and fault domains in an availability set you make sure that at least one VM remains running in case of an upgrade event or a hardware failure in an Azure datacenter.

Geo-redundancy might be important

The AWS outage teached us that using local redundancy doesn’t mean to be invincible! So Put your VMs and services onto geo-redundant storage (GRS).

GRS replicates your data to a secondary region that is hundreds of miles away from the primary region. If your storage account has GRS enabled, then your data is durable even in the case of a complete regional outage or a disaster in which the primary region is not recoverable.

For a storage account with GRS enabled, an update is first committed to the primary region, where it is replicated three times. Then the update is replicated asynchronously to the secondary region, where it is also replicated three times.

With GRS, both the primary and secondary regions manage replicas across separate fault domains and upgrade domains within a storage scale unit as described with LRS.

The GRS replica is not directly available unless Microsoft initiates failover to the secondary datacenter region. So if the primary site cannot be brought back online you will be redirected to the secondary site so you can continue to access your data and services. But what if a datacenter can be brought back online after several hours just as it was the case with AWS last week? Do you have to suffer the several hour outage? Does Microsoft initiate a failover? Can you react within short time? Yes, you can!

Jump from one datacenter to another one

Read-access geo-redundant storage (RA-GRS) actually is the same as GRS with read-only access to the secondary site. There is lot of fancy stuff you can do with the secondary storage and I’d like to give you some ideas:

  1. Access secondary data for analysis: RA-GRS replicates your productive blobs to a secondary site. As you have read-only access to the secondary site you can read the information within and analyze it.
  2. Use RA-GRS to copy blobs from your primary Azure region to the secondary: Outgoing traffic costs! Well, at least if you manually copy data from one Azure region to another one. RA-GRS automatically replicates data between two (pre-defined) Azure regions
  3. Use RA-GRS to jump from your primary Azure region to the secondary with your whole infrastructure: Well, that’s the supreme discipline and you can master it if you know, what to do. For sure you need to plan before the outage occurs since if it happened you’re only a passenger on a journey you don’t want to be on. So first of all make sure you use RA-GRS. What you also need is knowledge on how to use ARM templates. With ARM templates you can build your ARM infrastructure based on pre-defined parameters. VMs mainly consist of virtual hard disk files that are saved as page blobs on a storage account. Everything around, such as VM size, NICs, IP addresses, VNETs and so on can be re-built using ARM templates. So using RA-GRS in addition to ARM templates enables you to move your productive infrastructure to a secondary Azure region.

You see, there are several things to consider in terms of BCDR. On the one hand you can go hybrid using Azure as your BCDR site, on the other hand you can use Azure as your productive environment being geo-redundant and highly available, as well. Furthermore you can use any combination of both. And that’s only infrastructure considerations we’ve made by now.

Kind regards and see you soon,
Tom

Author: Tom Janetscheck

Cloud Security Enthusiast | Security Advocate

One thought on “Architectural proposal for highly resilient cloud infrastructures on Azure”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: