Real-Time Network Failure Detection

One of our customers had an interesting problem recently that caused a network outage in a critical part of the network. An interface blade was inserted into a Cisco Cat6500, but the blade didn’t pass diagnostics. The blade had been checked out in a lab switch, but for some reason, perhaps a bent pin, it was failing diagnostics in the operational switch chassis. The internal diagnostics detected the failure and shutdown the blade. Unfortunately, there was a problem in the IOS that didn’t properly reset the internal hardware and the forwarding engine stopped working. Strangely, though, the OSPF adjacencies didn’t die. That means that the 6500 could still send and receive packets; only forwarded packets were affected. The result was that the 6500 became a network black hole. Packets that should have transited the 6500 were silently discarded.

Because the interfaces were still up/up and routing adjacencies were maintained, the routing protocols in adjacent routers continued to include the 6500 as a valid next hop. When the outage was reported, it didn’t take long to determine what had changed, once the path through the 6500 was determined. A quick extraction of the card and the 6500’s forwarding engine came back to life.

Now I’m looking at how the network management system could have more quickly identified the problem. Note: I don’t think there would have been a way to prevent the problem, which seemed to originate with the hardware on the inserted interface card.

OSPF retained its adjacencies, and the interfaces were still up/up, so there were no syslog or SNMP traps generated. Since packet origination and reception still worked for OSPF, utilities like BFD (Bi-directional Forwarding Detection) and UDLD (Uni-Directional Link Detection) would not have detected a problem either. That leaves some type of reachability test.

Many network management systems include a ping-based reachability test, sourced from the network management system itself. The problem here is that the failure was on a path between two points in the network that could not have been tested with packets from the network management system. What tests run between routers and switches and can be controlled by a network management system?

Cisco’s built-in ping can be controlled via SNMP, and some network management systems contain functionality to setup a test and verify the results. However, there is no way for such a test to run and send a syslog or SNMP trap when a failure occurs.

The other mechanism is Cisco’s IP SLA. Numerous network management systems can automate the management and data collection of IP SLA tests. Tests can be setup to run from any Cisco device whose IOS supports IP SLA to other Cisco devices or to user endpoints, using a variety of protocols, including ping, DNS, HTTP, and UDP. The network management system needs to be able to monitor the IP SLA results, either through SNMP, or through the receipt of SNMP Traps, if the IP SLA test is configured to send traps when defined operational thresholds are exceeded.

It isn’t necessary to configure a full mesh of tests. It is frequently possible to configure a small subset of tests across the network that will provide visibility into connectivity problems. Setting thresholds for packet loss and jitter can provide useful information about the health of each network path. In most medium to large size enterprises that have reasonable network topologies, a few dozen tests should be sufficient for full testing. That’s a small enough number that it is reasonable to either use a network management tool to configure them, or to build a template and manually configure it on a few central routers. All that’s needed is the ability to generate alerts from the receipt of the SNMP Traps that are generated when a threshold is crossed. Now you have a real-time network connectivity alerting tool.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

Re-posted with Permission

Leave a Reply

Related Topics