Category Archives: Outages

How systems outages affect business and the real world

When critical systems fail

There’s some interesting things coming out of the bushfires royal commission; the last couple of days has highlighted the limitations of the emergency Triple-0 system, when surges in the number of calls outstripped available capacity, and overflow calls were put on hold, got recorded messages or were diverted.

The first half-hour of Jon Faine’s show on 774 is worth a listen for those interested, particularly the section from about 10 minutes in, with Garth Head, a former adviser to Minister for Police and Emergency Services. For geeks, it’s a reminder that sometimes the systems we design, implemennt and manage are sometimes critically important to those who rely on them.

Just in case you need to know

My main web provider logs all their problems onto a fault-tracking database, and publishes them onto the Web, including via RSS, to make sure their customers are kept informed, and can work around things where necessary.

Even down to the most trivial thing.

We are currently experiencing issues with the on hold music on our telephone system. This is causing customers to receive silence when placed on hold. Periodic messages are still being played.

This will be rectified tomorrow morning.

Maybe I don’t need to know that, but it’s reassuring to know they’re being open and honest about any faults that occur. If only all companies were this open.

Citylink goes down

CricketersA power outage resulted in a shutdown of Melbourne’s CityLink tollway tunnels today around 9am, for several hours. Apart from the obvious electronic signs that rely on power, I assume it affected lighting and exhaust pumps.

According to the Herald-Sun, Citylink spokeswoman Jean Ker Walsh said: “We have rebooted the systems that allow our operators to manage the tunnels safely.” So there you go. They rebooted the tunnel. Ms Ker Walsh also mentioned on the evening TV news that they’d be upgrading their UPS!

Interestingly on the Herald-Sun’s RSS feed, this story came through in the early afternoon. The feed claimed there was an attached picture, but it turned out not to be a picture of gridlocked cars or an empty motorway — rather it seemed to be a picture of cricket players.

The other effect of the shutdown was the Citylink web site also appeared to lose power… or perhaps it was just snowed under by the traffic. Like some other transport providers, they didn’t cope well under stress.

The Vicroads web site kept running under the load, though apart from showing slow traffic in the area, didn’t contain specific information relevant to motorists who might be caught there. I assume the information for radio reports and the like are gathered by phone, not off the web sites.

Too scared to wipe your machine just to improve performance?

If you’re too scared to wipe your machine just to improve performance, follow these
instructions for keeping your old installation in a virtual machine.

Seriously, you’ve got to check out the screenshot of this guy’s Start Menu. Don’t believe them when they say size isn’t everything!

Outages and response times

Cam ponders web hosting SLAs and wonders what’s reasonable. For his hosting, they guarantee 99.99% uptime, which works out to 52 minutes per year. (His outage was about 9 hours, or about ten years’ worth).

Bad stuff happens. We all know that. Even if it’s the most reliable setup ever. But there’s some major factors in determining what’s acceptable:

Frequency — If it’s happening too regularly, then there’s a reliability problem. They need better hardware, better software, whatever it is, needs to be fixed. Cam reckons it’s the second time in a few months.

Response — Obviously, you want a quick response, and a quick (and reliable) solution. There’s also sorts of monitoring tools out there these days. Typically anything like a full outage should be known about within minutes. A reputable web host will have substitute hardware ready to switch-on and go just as soon as that nice recent backup is restored.

Communications — Any third party like this has to keep the customer informed. There’s no excuse for not doing so. SMS alarms, emails, phone calls, whatever. (I wrote about alarms recently for my work blog.)

BTW, Cam’s also having troubles with his iPod… or more accurately, Apple’s 90 day warranty on replacement units.

I reckon he’s jinxed, myself.