Network Availability and High Availability Networks

Author
Terry Slattery
Principal Architect

Scott Hogg of Global Technology Resources (GTRI) did a nice blog post for Network World way back in April, 2009 about High Expectations of Network Availability (http://www.networkworld.com/community/node/40827) and a slightly more recent one in May, “Forget Five-9s — Go for 100%!” (http://www.networkworld.com/community/node/42281)   Scott spends a lot of time working with customers and has a good perspective on the requirements for a smoothly running network.

Related to the topic of smoothly running networks is the CiscoLive presentation on High Availability (HA) networks (see High Availability Networking (>5-nines) which offered specific advice for Cisco-based HA networks, but which can apply to networks built with other products.

I mention these blogs and presentation because more networks are being designed and run with high availability goals.  It is important to have realistic design goals when you are designing a network for high reliability.  Too much redundancy can actually make it more difficult to know what is going on within the network and to know how the network will react to specific failures.

I once did a consulting job where an organization had more than two paths from each site.  The problem was that the network was not engineered to handle failures.  The assumption was the failures were infrequent and that they would operate with slightly degraded performance when a failure occurred.  However, the other links became overloaded and had so much packet loss that the primary business application wouldn’t run correctly.  The result was that the network oscillated as traffic switched from primary path to a backup path.  The backup path subsequently became overloaded, causing traffic to then switch to an alternate backup path.  While the network continued to operate, application delays became significant due to the overloaded paths and subsequent packet loss.  The point of this example is that an overly redundant network that is not well designed can have no downtime, but the applications act as if the network was down.

Maintaining a HA network becomes a problem, because, as Scott notes, maintenance windows are becoming more difficult to obtain.  With applications running non-stop in VM environments, I expect network maintenance windows to shrink even more.  One way to address the shrinking maintenance windows was described in the HA Network Design presentation (mentioned above) at CiscoLive by John Cavanaugh and his team.  They explained how a dual core network where the two cores are cross-connected can have nearly 100% reliability because each core can be taken down independently of the other core to perform hardware and software maintenance.  Designing a network that will operate correctly with this level of redundancy can be tricky.  You need the right levels of redundancy, appropriate bandwidth at the right places, and the proper configuration of the routing and switching protocols to make it function correctly when a failure occurs or when you need to take down one of the cores for maintenance.

My favorite topic with HA networks is network management.  A good NMS must tell you when a network failure or when overload of devices or links occurs.  An HA network may experience one or more failures without an outage due to a good design and you need something that alerts you to the failure before a second failure causes an outage.  Unfortunately, there are many NMS packages out there, but very few that do all the things that are needed to monitor an HA network (I’m talking about network monitoring, not application or server monitoring, which is different in many aspects).

I use a combination of tools, with NetMRI providing the configuration and change management functionality.  The reason that it is good is that it has two fundamental capabilities that are required:

  1. It can analyze configurations to detect exceptions to configuration policies.  For example, are the ACLs for SSH and SNMP access consistent?
  2. Scripts can be run on network devices to execute commands.  When the ACLs for SSH and SNMP need to be updated on hundreds of devices, you want to have a system that can do it for you and create a log of the successful and unsuccessful updates.

These capabilities are critical to a smoothly operating network because the majority of network problems are due to configuration mistakes.  How do you make sure that your network is properly configured and that new changes that need to be rolled out are properly implemented?

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply