Veriflow at NFD16: Continuous Network Verification
After writing the Seven Deadly Sins of Network Management blog post, a post with a more positive focus seems appropriate. So let’s look at how network management vendors might improve the state of the art.
I’ve been thinking a lot about this subject lately in light of a large proposal we submitted recently to help troubleshoot a company with poor application performance. The application in question is in the early to middle stages of a large multi-year web-ification project, to better deliver services to in-house staff and customers. As part of planning, our team discussed tools and how we could get traction on the problem, based on similar prior work. Doing such troubleshooting is never quick – and getting quick and good answers out of most tools is part of the problem. That led me to start thinking about what the dream tool(s) might do.
Activating my crystal ball…
Here are my thoughts about key aspects of a future network management product:
First, we need better instrumentation. Equipment vendors need to provide SNMP variables indicating when their device is stressed. This is key. We need visibility into “opaque” devices such as firewalls (partial offenders), IDS/IPS and web proxy devices (major offenders), server load balancers, and other devices.
Part of the problem here is firewall and other vendors getting cagey about actual field performance, especially with all the features enabled. Vendors: You need to get with it, too! Allow instrumentation of KPIs (Key Performance Indicators), and be candid about max levels.
I’ve come to really dislike troubleshooting application problems with such “mystery boxes” inline, where the only real measure is comparing throughput to the vendor’s stated specs. (These are always found in a sales slick, so I only trust them to within a factor of 2 to 4, and that with an asterisk – namely, that there may be some ways of using the box that fall outside that performance regime.)
What I’ve ended up doing when troubleshooting is removing the offending box from the data path, to see if that clears up the performance problem. It often does.
Is good performance and stress reporting something you look for before purchase? I sure don’t think they’re part of the vendor marketing! Until they are, good luck getting good reporting!
Even when we get such reporting, we’ll also have to have the network management tools pick up on it. That takes time – and customer requests.
On the network management tool side, any element monitoring has to be automatic, canned, no-brainer. Preferably with warning, alert, and critical threshold levels pre-defined. And preferably ubiquitous, i.e. cognizant of the SNMP MIBS, supported by the device vendors.
Any thresholding and alerting, ditto, automatic, good, comprehensive rather than a small set of chosen variables. User-adjustable, sure, we need squelch for noisy alarms.
We’ve been doing network management with SNMP for 25+ years. Have management software and hardware vendors not learned what settings are appropriate by now?
There will be more data. Lots of it! Causes:
Regarding that last point, if you haven’t noticed, this year I’m keen on User Experience (UX) probes:
As I’ve written previously, UX data trumps element monitoring. It may not provide as much detail… but if it can provide performance data for a few key applications (or the one with performance problems), maybe UX data can help narrow down where the problem is — without the major ordeal (in present products) of fiddling with loading MIBs, setting up polling, thresholds, etc., etc.
To elaborate on that: With many probes, we can look at where UX is good and where it is bad. That might help localize where the performance problem lies. Once we know the rough problem location(s), we can then look at the larger amounts of element data to attempt further localization of the problem.
Yes, Cisco ACI packet loss reporting may do something similar – in the datacenter, but not elsewhere. IP SLA in Cisco routers does something similar – but some organizations require change windows or add-on old routers, for fear (based on experience) of clobbering production routers by turning on too much IP SLA. Also, it’s unclear how to cost-effectively use IP SLA to validate UX via a specific WLAN AP.
UX probes can also help us objectively report improvement or non-improvement, when we tweak application, server, storage, or network. That beats having an FNG (“new hire”) testing with a laptop at several times and/or locations.
Who thinks current correlation tools are costly, complex, and maybe don’t do all that much? Did anyone not raise his or her hand?
I’m hung up on the idea that data as to Good/Poor/Failed performance, plus correlation, could be incredibly useful. Do current (costly) tools do that well? Automatically, without a lot of work?
I see this working well if tied to a map or network diagram of some kind. And I don’t mean HP OpenView or SolarWinds maps turning colors. Or maybe I do, but a modern version, which tracks performance as well as up/down state. The one thing I value about colorized maps is that they provide scoping information at a glance. They’re terrible for letting you know there’s a problem, but scope is useful when troubleshooting begins.
What struck me over the last year (thanks, NetBeez) is how to easily solve the issue of tying a physical or virtual probe to the nearby network elements. I’ve got a large customer now doing a massive Gigamon deployment, and one “lucky” person must diagram which ports connect to which locations in the network. That’s a lot of work. If you deploy UX probes (virtual or physical), something similar might seem to be needed.
Here’s how to automate tracking to find out where a given probe lies. If a probe does traceroute, the management platform can tie that to the first hop router. That more or less provides position on the network. If the MAC is correlated with switch MAC table data, the location can be more precisely determined. Done!
If the management platform does a good job of discovering the network, it can then colorize a map or diagram with the probe data. Use red for down, orange for poor performance, green for no known problems. I suspect that is a far simpler programming task than coding automatic polling a bazillion MIB variables, thresholds, etc. (Well, a few tens of variables, the rest may well not be all that useful).
The sorts of path diagrams that are used in the AppNeta PathView and ThousandEyes cloud reporting platforms might be very useful with this. For some reason, they look like “railroad track diagrams” to me. Is that a good name for them?
Dare I say that I expect such capabilities in a good future SDN controller platform?
Documented Capacity Planning
Thought of the day: Isn’t it odd that to snapshot and trend capacity in a documented way, we have to manually capture data points from a graph into Excel? The problem: long-term data rollup in most products averages traffic peaks out of existence, unless something like 95-percentile data is saved as well. So quarterly or annual projections are not reproducible for accountability unless the data is manually re-entered.
I’d like to especially invite comments to this blog.
What do you think of the above ideas? Are there products you like, that do some of the things above? Are there any that do a great job of out-of-the-box polling and thresholding?
I’ve seen a lot of network management tools over the years, but tools evolve, and even if I were working with them full-time, I doubt I could experience the full capabilities (and drawbacks) of more than a few. What devices belong on the “list of shame” for lack of useful SNMP information? What devices allow good easy data export?
Hashtags: #NetworkManagement, #FutureNetworkManagement, #UserExperienceManagement
Veriflow at NFD16: Continuous Network Verification
Pluribus Networks at NFD16
You Have a Monster Hiding in Storage