I guess I am somewhat OCD when it comes to the uptime of our servers. At first I was very excited when I realised that selling on line meant that customers could buy from us any time of night or day. We were always open even when we weren’t there. However, I soon realised that for that to be truly the case you actually need resilient technologies, 100% uptime and excellent out of hours support. So began a long quest to keep our sites online and open for as much time as possible.

The solution has been to partner with top quality service providers, develop our own web technologies and always follow up anything that looks odd (more on that strange sounding final point another time). And so as a technology-led retailer you consider that you have done all you need to do keep your site running.

And yet almost every week we have issues to investigate. The upside is that because we have a great team the issues are generally resolved quickly. After each incident we conduct a root cause analysis to identify what exactly went wrong and what we can do to stop it happening again. yet the root cause analysis keeps coming up with the same source of problem time after time – a slap dash and cavalier approach to uptime from many of the people who we didn’t even realise were our suppliers.

For exmaple, we had a complaint from a customer that there was a security warning when she checked out on our site. She rightly wanted to know was everything OK and should be worried about her details being compromised. Root cause analysis found that one of our marketing partners Mythings was allowing all of its clients to put down an anonymous tracking cookie after someone had checked out successfully. One these clients had an out of date security certificate and so was generating the error. When a couple weeks later Mythings started serving up a JavaScript error in our checkout tunnel we decided the problems out weighed the benefits and ceased working with them.

However, it did cause me to some thinking about uptime and depending on other partners. We are currently able to offer around 99.9% uptime inlcuding planned maintenance. That’s around 10 minutes of downtime a week and, if it’s planned, that will be at 3am. Now if every partner we work thinks that having here service available of 99.9% is all they need too, then if we have 30 dependencies then that works out at an average of 300 minutes or 5 hours of degraded experience or downtime every week!

So I have begun asking suppliers what we can expect by way of uptime – most are not very forthcoming and in some cases we don’t even know who they are!

So apart from firing everyone who let’s us down, what can we do to mitigate the risks of downtime further?

Currently we are prioritising implementing methods working around suppliers being down. For example, we display independently collected review information on on our homepage in a widget supplied by Ekomi. We believe this provides credible third-party validation of our excellent service. But when that widget failed for several hours, it took 30 seconds to load the home page – loading a  page slowly is a classic reason by potential customers bounce off to someone else’s site. But we can mitigate this problem. By putting the widget into an Iframe the page will fully load at full speed irrespective of how slow the contents of the widget are. A neat tactic in mitigating our risk with supplier down time.