If you ask most IT Directors what single system they support that carries with it the highest level of stress, you’ll likely get a pretty uniform response – their e-mail system.
This point was driven home like a wooden stake in the heart of the vampire last week when Microsoft’s Cloud e-mail offering, Exchange Online – part of their Office 365 suite – decided to go tango uniform for more than 5 hours.
Most people that make a decision to move their e-mail operations do so for one of two reasons. The statistically more likely reason is that properly implementing and operating e-mail systems involves a great deal of both capital investment and operating expense. Like it or not, e-mail is mission critical – the vast majority of business transactions today take place and are documented by the organization’s e-mail system. E-mail downtime equals paralysis, pure and simple.
Designing in and then implementing the proper degree of hardware and software fault tolerance is staggeringly expensive, and employing the people that understand and can effectively manage that complex infrastructure is only slightly less so. In startup or small organizations making the necessary expenditures can be prohibitively expensive, from both the capital investment and operating expense perspectives.
The second reason is almost a logical outgrowth of the first – uptime and availability is a function of both investment and expertise – and smaller organization should be able to exploit that advanced expertise and larger, leveraged investments. At least in theory, the major providers of cloud e-mail services – Microsoft, Google, et al – should be masters of that discipline. Looked at dispassionately, Microsoft should be able to design and operate a multi-site fault tolerant service that can take a direct hit, lose a site, and no one will ever notice.
Operative concept here is should.
One of my favorite writers on the subject of engineering is Kevin Cameron, who is a mechanical engineer by trade. Kevin is a large, versatile mind, and his thoughts on design principles and practice have applicability far beyond the internal combustion subjects about which he writes.
One of his recent columns was examining the systems integration issues being encountered by new forms of multinational production, especially those being seen in the commercial aircraft industry. Kevin concluded that the types of systems now being designed are so staggering complex, and their components have such an extraordinary range of interactions with each other, that engineers no longer have the ability to predict the types of system failures that can occur – essentially, that systems have become so complex, that when one component failure starts to cascade to other system components, that the collapse of the system is a completely unpredicted event that may cause large portions of the systems to have to be redesigned to accommodate it.
Kevin might have been talking about the new Boeing Dreamliner, but he just as well might have been talking about any complex cloud infrastructure – both Exchange Online and Amazon Web Services have both experienced widespread system outages that their architects would likely have told you were impossible before they occurred.
Everything built by the hand of man can and will break. The odds of such failures – if proper precautions have been taken – may be infinitesimally low, but they will occur.
Good IT service providers are fully prepared to communicate definitive information to their customers when stuff under their control, and on which their customers depend, inevitably breaks.
My employer’s IT Service contracts define extremely specific notification procedures and methods to be used in the event of a failure with the potential to impact the customer’s business. Those commitments list specific individuals to be contacted, the times they will be provided with initial notification and updates, and the specific communications methods that will be used to make those notifications.
When people entrust you with their business, the most important thing they want to know is that you, as an IT Service Provider, understand the disruption and problems that a service outage is causing. They want to know that you are aware of the scope of a problem, that you have a plan to address it and are working that plan, and your best estimates as to when service is going to be restored.
Contrast that best practices approach with what occurred during the Microsoft Exchange Online outage. Astoundingly, with untold tens of thousands of folks depending on the service, Microsoft didn’t appear to have an out-of-band method already defined to proactively communicate with their users.
“I know… we’ll just send out an e-mail…..oh….waitasecond…..um….”
When they did finally determine a method to communicate with their customers – several hours into the event – we were treated to the unseemly situation of Microsoft resorting to the use of Twitter … one of the biggest technology companies in the world using someone else’s application….to get the word out. This event and the way that Microsoft shared information with their mission-critical customers made them look unprepared and frankly, like rank amateurs.
Everyone makes mistakes. It’s how one responds to those mistakes that demonstrates the quality and character of an organization.
If you go to the market to purchase any sort of IT Service, you should rightfully expect that your service provider will have predetermined and failsafe methods already defined to notify you of any issue that has the potential to impact your business. You should also expect that those methods and escalation paths should be documented and part of your agreement with them.
A service provider that doesn’t think that proactive communication around outages and return to service isn’t a critical part of what they should be doing for you is either guilty of extreme hubris, delusional, or maybe both.