Replicating at Speed
Here we are, it’s Summer, it must be time for another rant (ahem, “carefully reasoned polemic”) about Network Management (NM). What should NM tools do for us? How might they help? Let’s explore that a little…
Have you ever had to sit through a boring network management product training? If so, that may be a sign of poor training development, or it might indicate something deeper.
When you’re sitting in training, what do you care about? You’re probably thinking, “How do I use this tool and how does it help me?” A lot of NM training is more oriented around the tool, installation, admin, and driving the individual components of the tool.
Symptom: “Show me the value! Need more coffee so I can stay awake.”
That’s a problem. The vendor hasn’t actually communicated with users of the product, watched their workflows, etc. Especially picky / critical / insightful users (Those are the adjectives I’d like vendors applying to me, not the four-letter words they may actually be using).
I had the bright idea of doing use-case oriented Ops training for one site, in one hour “lunch and learn” nuggets. Then, I found it hard to come up with ways the NM products at that company actually enhanced troubleshooting.
That’s not a good sign! And it ties back to the thought that vendors maybe need to do a better job of talking to their customers and their use cases. For that matter, talking to non-customers might be VERY educational (hearing why they think your product is pathetic will certainly indicate what needs fixing — assuming the non-customer is clueful).
All this might also explain why most networking people that I’ve watched go straight into SSH / show command work when troubleshooting.
Detecting outages should be easy, given competent / complete network discovery (and preferably, mapping). Tools have been doing red icons or lines in a log for decades now.
Brown-outs (service slow-downs) are more the problem these days.
Map or path colorization is useful but can be one-dimensional. You can use colors for up / down status or for various performance measures (utilization, error, or discard percentages). But only one of those at a time? So maybe our map needs to let us shift between colorization schemes? Or display multiple link / router metrics at once somehow?
This is still coming at it from the “got info” rather than the “help me solve this problem” aspect.
Suppose you have nothing to work with, namely the trouble ticket looks like “App X is slow”, and you know nothing about the application. That’s often the case. Then you’re in the awkward position of having to establish that a lot of the network is innocent, as in, not the cause of the problem. We’ve all been there, MTTI (Mean Time to Innocence), and finally other teams start looking at their piece of the puzzle. It’s far better if everyone doesn’t assume it’s the network until they’ve checked their piece of things is innocent.
So “scanning for possible problems” is where maps potentially help us absorb a lot of information quickly (threshold / fault logs less so).
As a result, however, we may end up checking out every glitch that shows up in our network management tool. Not very efficient!
If we have application endpoints (CEO to Internet, or user to application front end), we can then do the path trace / path display thing. If we get colors or other indicators of potential problems with easy drill-down, now maybe we’re troubleshooting faster. We’re certainly not writing down the hops in a traceroute in each direction and poking around one device at a time. I’ve seen folks doing both of those, all too often. Slow!
In the last couple of years, some tools (NetBrain, Riverbed, then SolarWinds) came up with tools that show paths (traceroute, to / from when asymmetric, other wrinkles). This approach is what I call “Google Maps for network management”, colorizing or showing info along the path. Paths are good; easier to code and display flexibly than whole chunks of network.
Maybe the tool could even do that for different metrics, as suggested above.
The Network Field Day presentations by SolarWinds around their path capabilities were interesting, and to some extent made me say, “whoa, it’s not that simple.” Ok, but still, vendors could do something with paths, then do it better.
Licensing and server capacity should be non-issues at this point, for well-written products on up to fairly large networks. StatSeeker and some other Australian products demonstrate great polling and storage speed.
Taking that to the next step, perhaps network neighborhood maps would be good. As in, I think the problem is around this part of the datacenter, or around the CEO… Colorization of maps has done that for a while, but primarily for outages.
The problem with all this is, it lacks specific enough information. Even if you’ve got fairly good trouble ticket information that “Application X is slow”, the network path approach needs some network endpoints to work with. This is a case of network people or tools taking a network focus… sort of like the joke about “I lost my watch over there, but the light is better over here.”
Seriously, why are we stuck with tools that talk only to the network devices, and not anything else? One reason might be having control and ease of getting access. But if you step back and think about it, that assumption is rather limiting.
When I started writing this blog, I thought I wanted application flow tool (AppDynamics, etc.) integration with classic SNMP tools, all on a path diagram or map. However, that may not be the right answer!
For Cloud, we’re not going to have SNMP data about interfaces and devices, unless perhaps we’re using virtual devices. The best we can hope for is for actual application or agent data to tell us about service delays, one-way or round-trip time, latency, jitter, and packet loss. Which, incidentally, are the things good SD-WAN products should be tracking based on the application traffic flowing through them.
We probably need a third-party tool to do that because the app developers may not consistently instrument their application, either internally or using the cloud provider’s tools for tracking application / service health. And you don’t want to have to code your own app to get info from 100 different app API’s, something commercial vendors are unlikely to support. It all comes down to who you want to pay for information about what’s going on.
What I’ve seen so far in the app flow information biz seems very service oriented, but not strongly knowledgeable about where the services were running or what platform they were running on. Being a network person, we need to know where the various services are located. Are the tools going to evolve to help us with location info? Or does our naming convention or documentation (which is either missing or out of date) somehow have to do that? In other words, more manually tracking things down instead of automation?
Even if your organization isn’t that heavily in the cloud yet, we network people need service location information anyway. I’ve worked for a few organizations collecting that information by talking to people with institutional knowledge (for lack of consistently useful other tools).
Doing that takes quite a bit of time. I know of one instance where a major travel website was having issues for three weeks while they tracked down a former employee through several subsequent employers.
Conclusion: Document now or be prepared for whoever has to troubleshoot it later to experience pain and possibly major delays just gathering app flow / server information.
So, while the tools are costly, so is application discovery by consultant and downtime. The right tool may have the answers you need when something goes awry, whereas having to hunt things down manually (NetFlow, etc.) takes time.
Imagine that someone moves a service from Cloud A to Cloud B or back to private cloud, increasing latency. That causes response time of the service and of services that depend on that service to slow, so cloud auto-scaling kicks in to load-balance across more micro-service instances. The invoice at the end of the month shocks management and breaks the bank (or maxes the credit card causing cloud provider to halt all server instances).
How long does it take for someone to notice that latency went up, the load on the relocated service went up as well, and then figure out it’s a service location problem, not an underperforming instance?
How do we troubleshoot something like that, let alone automatically, before high cost ensues?
IF you accept the above, then I have some tentative conclusions:
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Hashtags: #CiscoChampion #NetworkManagement
Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at firstname.lastname@example.org.
Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.
John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services. Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.
He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.