The Network is Down! — Avoiding Network Outages

Author
Terry Slattery
Principal Architect

How can you avoid the words that no CEO wants to hear: “The network is down!”? The most important step is a regular network infrastructure review.

It Can Happen To You

How do you know that your network is not an accident waiting to happen? Just because your existing network has never gone down doesn’t mean that it won’t in the future. Computer networks are complex entities in which multiple protocols and many network elements need to function correctly.

John Halamka, CIO of Beth Israel Deaconess Medical Center in Boston, wasn’t aware of any problems in his network until a spanning tree problem took out the network for four days. The story was chronicled in an award-winning article All Systems Down, which appeared in CIO magazine in 2003. I encourage you to read it to learn what happened and how he handled it.

Ah, you say, “That was 2003! That was 12 years ago! It can’t happen anymore.” I suggest that you now go read “Our bullet-proof LAN failed. Here’s what we learned.” Paul Whimpenny, Senior Officer for IT Architecture in the IT Division of the Food and Agriculture Organization of the United Nations, describes a network outage similar to Beth Israel’s that happened very recently. Fortunately, Paul’s outage was only four hours long.

A known type of failure in a common network protocol caused both outages. Could they have been prevented? Sure. Could the outage time have been reduced? Absolutely. Both outages could have been avoided by doing a periodic network review. Think of it as similar to an audit of the financial systems. There are designs and operational best practices that lead to improved network performance and a reduction in potential network failures. Why wouldn’t you want to use them since they lead to better results? Note that implementing best practices often doesn’t incur a substantially greater cost.

Steps of a Network Infrastructure Review

What can prevent future outages like those described in the articles above? The first step is to do periodic network infrastructure reviews. They are like an annual health checkup or an audit. A review should reduce the risk to the business that a major network failure can occur.

The second step is to implement the recommendations of a review — or at least the most significant findings that create risk of a major failure. At NetCraftsmen, we’ve done a number of reviews where the client then didn’t follow up to correct the most significant problems. Sometimes, the view from the technical staff is “it hasn’t happened yet.” That’s like not carrying automobile insurance because you haven’t yet been in an accident.

Many of the problems we identify in a network review are latent faults that will cause problems only when certain conditions occur. Those conditions will eventually occur. Even if it happens once, as Paul noted in his article about his “bullet-proof” network, the tech staff may say that it can’t happen again. I wouldn’t bet my job on it.

Sometimes we find that the network technical staff is threatened by an outside review. Here, the company management needs to address the review as a regular event, much like a financial review or any other regulatory review. In fact, Sarbanes-Oxley compliance is often justification enough to conduct a network infrastructure review, since the federal law requires that companies implement security best practices for any system related to financial reporting.

What are the Costs?

What does it cost to have a network infrastructure review? It depends on the size and complexity of the network. A small network might be $50,000 while a large, complex network might be upwards of $150,000. An experienced network review team will be comparable in cost to a financial audit team. The result is a comprehensive report that describes the current state of the network, any vulnerabilities that were found, and the risks associated with each. Obviously the most serious or highest risk problems should be corrected as soon as possible.

Another perspective on the cost is the value of avoiding a failure. Examine how much each hour of outage costs the company and estimate how long an outage may last, based on stories like those above. That should provide an approximate number.

Summary

Significant network outages are preventable. Network infrastructure reviews, just like fire code reviews and financial reviews, provide the information that allows company management to understand the risks to the business. For a deeper conversation about what would be involved in a review for your organization, feel free to reach out.

A review is just the first step. It needs to be followed by a program to correct the highest risk findings. The result is greater confidence in the ongoing health of the network and avoiding the words you never want to hear: The network is down!

Leave a Reply