Veriflow at NFD16: Continuous Network Verification
I’ve been enjoying working with a fairly large company with some excellent staff, reviewing network and UC management tools, processes, and gaps. All that network management tool immersion has triggered some high-level questions in my brain, and I’d like to share some of those with you.
The immersion means I’ll likely have more to say on network management tools in future posts.
Those of you who have been following my blog posts know that I find Network Management tools intensely frustrating, basically because in 25+ years they seem to have evolved so little. I comprehend the economics behind network management products; it is a small market compared to Servers and now Security, so budgets cap R&D. Even so, I’m left feeling that surely we could do better, a lot better!
I’d like to go beyond the basic topics covered in Succeeding with Network Management Tools, which is about getting the most out of your tools. Today I’m asking: Are the tools even doing the right things?
Most organizations I’ve worked with lately, large or small, pretty much match a pattern: little spending on network management tools, existing tools under-maintained, and/or no or relatively few people dedicated to network management.
The organization I’m working with now has several sharp people tuning alerts and getting them mastered, surmounting discovery challenges, putting revamped processes in place, considering adding event correlation rules, etc. All of that is refreshing to see. A lot of effort has gone into the tool tuning, and it is all paying off for the customer.
That doesn’t seem to be something that many other organizations can achieve, due either to willpower, budget, staffing levels, and/or time. Perhaps ROI or value provided, as well – that’s a theme for a follow-on blog post: what are network management tools really good for? Where do they provide value, and where is the value less apparent, or not there?
I have become very leery of the word “correlation” with regard to network management tools. It is certainly possible and useful, but what comes built-in tends to be rather minimal. A toolkit is not a solution. Watch out for vendors who say their tool “can” do something, when what they mean is, “It wasn’t built to do that, but with a lot of work you might be able to get it to do that.” You have to ask sharp questions sometimes to find out what the product includes out of the box, compared to what you can build with it.
Let’s now turn to the big philosophical question I want to pose:
Should we be concerned about:
Bottom line: Is all this labor sustainable?
Thinking about the manual items above, the whole approach is a rather ironic situation. We turn on more and more device traps, process syslog, and send some of it as alerts, do polling and threshold to get more alerts, all so we can be aware of problems. Except that we’re drowning in alerts, which hides the problems. So then we end up going down the alert tuning path (do I need this? Is it actionable?) and/or the correlation path – making more and more work for ourselves. At least, that’s how it looks when I step back from the details to gain some perspective.
For the most part, I think network management customers have “voted with their feet”, i.e. abandoned labor-intensive tools in a loud “NO WAY” response. Surely there has to be a better way to do things?
My answer in a previous blog post is that the vendor must do it for you – automate everything. Yet a lot of the above items could be hard for them to program. Also hard to shop for: in comparing tools, no vendor is going to give you the details of what SNMP variables they poll, their thresholds and alerts criticalities, what their correlation secret sauce does and doesn’t do, etc. So how do you detect shallow versus deep correlation when looking at tools – particularly when trying to review tools without an inordinate expenditure of time and effort? In short, even if you’re trying to buy a tool that automates things well, how would you recognize it?
There are at least two answers appearing on the horizon.
One is machine learning. For example, the co-founder of NetCool is behind a company called Moogsoft, which claims to do rule-less correlation and alerting via machine intelligence. Automatic – good! I’d like to see an objective trial to see how well it does. I can believe it might find unique combinations of things to alarm about; that it might spot failure trends early; and/or might cut a lot of the noise down. Their estimates of reduction in alert counts seem to vary from 50% to 90%. Does it also alarm about important events as well as the current tools do? Of course, if you’re having trouble maintaining and tuning to keep your current tools working well, that may not be a high bar to surmount.
The other answer is for applications to display events on a network diagram (or part of one) with changes and alert counts (or auto-selected alert types and counts) superimposed on the diagram.
Doing this might look like a Google road map that uses colors and icons to indicate real-time traffic conditions, accident and police activity, etc. For the network, I’d actually omit the traffic, and use colors to show alert counts and detected performance problems. If you have packet capture probes, you could even detect things like retransmissions, high latency, slow server responses, and more.
Which of these is a better way to convey lots of information and let the viewer correlate it easily: a Google traffic map, or a radio listing of traffic problems? With radio traffic reports, you have to think about where each one is and do more mental processing. I do have to note, radio is probably safer when driving, unless you look away from the road trying to mentally visualize the road map.
Multiple sources of information integrated via a map: What’s not to like?
Some vendors seem to be going down this path:
There is another side to this. We really need to tie applications to the network, if for no other reason than eliminating the network as a possible cause. Having maps and displaying network issues and changes, and server problems – to me, that’s the Holy Grail right now. Hopefully not quite as unreachable a goal! Seeing a given application’s server endpoints and flows on top of a colorized/iconized network map – bonus!
There are a lot of server/application centric tools, which I’ve been exploring but have less experience with. And there are network-centric tools. They all need to be tied together, and made easier to use.
The other thing I’ve been noticing is how non-GUI products have gotten, or got before the current crop of dashboards and graphical elements started appearing (it is springtime, after all). Linear listing of alerts seems to be a common and rather boring, uninformative GUI. Easy to code, apparently far easier than responsive smart maps in a web GUI.
Dashboards with drill-down capabilities also seem to be coming across various products. Sometimes they’re just event-oriented. I’ll grant that heat map matrices represent another display paradigm that potentially is relatively easy to code and fairly useful to the user. So while I personally like network maps, dashboards may well be part of the answer as well.
I get the impression Cisco is really rethinking their network management products, given how they’re pushing automation and Enterprise Network Virtualization. With the focus on rapid configuration at large scale via the APIC-EM controller, can rapid and increased information collection be far behind?
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Veriflow at NFD16: Continuous Network Verification
Pluribus Networks at NFD16
You Have a Monster Hiding in Storage