Replicating at Speed
Working as a consultant at Chesapeake NetCraftsmen is great, because I get to see a lot of interesting network problems. Sharing information about these problems is valuable, because it is better to learn from someone else’s mistakes than to make them ourselves. In this case, it was a self-inflicted denial-of-service attack.
This customer has a few network management products, each of which provides utility for a component of their operation. One NMS system, hosted in the main data center, is configured to perform ping reachability testing of devices (lots of products include this type of functionality). It will send an alert to pagers or mobile phones when a monitored device is no longer reachable. There are 28 people that receive these alerts. The NMS is normally quite well behaved and the network operations staff is happy with how it works, providing early indication of network problems when reachability is affected.
One day, the NMS caused a major problem. A key distribution switch had a blade failure which caused a big chunk of the network to become unreachable. Suddenly, about 650 devices of various types were unreachable. The NMS quickly discovered that the devices were unreachable and started sending alerts. The problem was with the number of alerts that were sent. There were 28 people on the alert list and 650 devices that were unreachable.
28 people * 650 unreachable devices = 18,200 alerts generated
The paging and SMS systems were suddenly overwhelmed. It turned into an unintentional, self-inflicted DoS attack on the alerting system. It took a long time for the alerting system to process the alerts. New alerts couldn’t make it through, and even if they did, they were lost in the volume of alerts from the failed devices. Fortunately, there were no other critical outages at the same time, but the impact made it clear that the existing mechanism didn’t work well when a key switch failed.
So the customer has taken several steps to mitigate the impact of a similar failure in the future. The NMS includes a feature that allows an alerting hierarchy to be constructed, but it is a manual process, not automatic. They have spent the time to create the necessary hierarchy. Some NMS incorporate the ability to suppress downstream events, often called “root cause analysis”. In this case, understanding the topology is required in order to suppress the alerts about downstream devices. Some NMS products include automatic suppression of alerts based on topology, but many do not. This customer has now taken the time to configure suppression of many of the downstream devices.
The second remediation factor was to reduce the number of people who received the alert. Reducing the number to half the original figure cut the number of alerts in half. The combination of topology-based suppression and reducing the number of recipients made a big difference in the number of generated alerts. The only problem now is to maintain the suppression list. That’s where the NMS products that do automatic suppression are very useful.
What happens in your network management alerting system when a key network device fails?
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html
Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.
John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services. Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.
He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.