The systemic risk of public cloud

The systemic risk of public cloud

Posted by Geoff Davies on Jan 6, 2020 3:21:00 PM

When UK mobile operator 02’s data network went down for a whole day in December 2018, it brought home to many people just how interconnected the many services are that we take for granted. The inability to get email on the move or to use Google Maps to navigate to a meeting was an annoyance, but for thousands of Uber drivers it was more serious, leaving them with little choice but to buy Pay as You Go SIMs on other networks in order to continue making their living.

In fact Uber drivers were fortunate: the other networks were operating normally, and switching to another one took a matter of minutes. Things were not so easy for the tens of millions of people who rely on the availability of Microsoft’s cloud-based Office365 service when that experienced a prolonged down time of Multi-factor Authentication in November. Not only were they unable to access all their Office365 apps and data, but many also discovered that they were effectively locked out of other applications such as Smartsheet, Xero, and Insightly, which can share Office365 authentication. With a cellular network outage it’s easy to switch SIMs, but when parts of Office365 go down those that rely on it have no option but to wait for Microsoft to fix it.

This highlights two potential problems for organizations which rely on the availability of applications and services in the cloud. Firstly, cloud services do go down, and it’s not easy to switch to another cloud provider when that happens.

But perhaps more importantly, a huge number of companies are relying on the availability of a very small number of public clouds. (Of course to access them they also rely on a very small number of telecoms networks, but that’s another story.)

According to the Cloud Security Alliance, about 42% of application workloads run on Amazon Web Services, and a further 29% run on Microsoft’s Azure. And the fact that AWS and Azure account for well over two thirds of cloud workloads has become a cause for concern to regulatory authorities in many industries. For example, last July the European Banking Authority issued a report warning of the systemic risk arising from the international banking system’s concentration into such a small number of public clouds.

So what can be done to mitigate the risks presented by this concentration of computing resources in tiny number of huge public clouds, which will, from time to time, encounter availability problems?

A lot can certainly be done at the application level, by architecting them for the cloud to meet specific resilience requirements and specific RPOs (Recovery Point Objectives) and RTOs (Recovery Time Objectives).

Which means it’s important when moving to the cloud to conduct application-centric migrations and transformations that capture this type of information and determine information such as business criticality.

Indeed, it was in part for this reason that we use the AppScore platform to capture, assess and plan at the application level in order to ensure successful cloud adoptions.

It’s also important to realize that putting an application or service in the cloud doesn’t free you from the responsibility of keep it running: standard, well established principles of redundancy and resilience still need to be applied. That means you need a disaster recovery plan in place that’s tested and proven.

The good news is that DR from one cloud location (region) to another can be far easier than switching from an on-premises data centre to an alternative site.  It’s important to use a multi-region strategy, and – where the cloud provider supports it (such as AWS) – it’s wise to reserve capacity in a specific availability zone as a disaster in one region could mean a large number of companies would be looking to recover to another cloud data centre simultaneously and you could get locked out.

The bottom line is this: the cloud may be a different world, but tried and tested resilience and redundancy principles still apply. When used effectively, the cloud provides greater resilience options at a better price-point than on-premises or co-located datacenters ever can.

Just remember to factor this into your cloud migrations and understand the criticality of the application to the business and their resilience requirements. Which means running cloud adoption at the application level rather than at the server level.