Some Best Operational Process Practices
What’s eating your network? What is (quietly) killing performance? What performance items should you be watching, but probably are not?
I’m a big fan of appropriate network management data, with the right tool(s). If you don’t monitor absolutely every interface, and capture historical data, you’re flying blind. Most sites using the popular SolarWinds products do not do this, because of the cost of per-interface licenses. Many products charge per node (network device), and that includes some number of interfaces per device, making per-interface costs less of a concern.
Unfortunately some do not, charging prohibitively high per-interface fees.
If you let your interface licensing costs drive your management strategy, you’ve got the wrong product – or you like wasting your time picking which interfaces to manage, and wasting staff time with missing data. My experience is that most Ops staff lose patience and stop using the product when they repeatedly experience data gaps. The product doesn’t do any good if all the potential consumers have abandoned it.
For me, the top items for network equipment are utilization, broadcasts, errors, and discards. These all need to be displayed in Top-N fashion as percentages of total interface traffic, separately for inbound and outbound traffic. Percentages are needed so you can tell whether a Big Scary Number actually matters, or is a tiny fraction of all traffic.
We need to separate in- and outbound traffic because when you add or average them, necessary information is lost. For example: does 100% mean 99% in and 1% out, or 50-50? Or would the max single number value be 200%? Most vendors provide no clarity whatsoever, they just hit you with “this is the utilization.” Useless. Separate numbers: much more valuable!
Context is key here: not just the number, but is it a percentage or what? If it’s a raw number, over what period of time is the average/maximum/minimum (since all SNMP-derived stats are inherently averages over some time period)?
I also want history, as in graphs, so I can look to see what’s changed. Preferably with easy viewing of last day/week/month/quarter/year. (Disk space has been cheap for 25 years now, so why are so many products stingy about storing data? Bad database storage format hindering rapid retrieval?) I like sliders where I can pick start and stop dates and times, rather than calendar picking or typing dates and times.
If we have a slow application and I see that errors went way up around the right time, the two items might be related. If broadcasts go through the ceiling, well, that counter also counts L2 multicast, and it might just be BPDUs. For what it’s worth, I usually ignore ports with less than, say, 50 Kbps total as far as high broadcast percentages, because that’s what you’d see with an STP blocked port, where the only traffic is BPDUs.
If I see errors over 0.001% (that’s not a typo), I would strongly suspect a bad cable and go fix it – immediately. Most sites only react to 1% or even 10% error rates. That’s way higher than you should be taking on modern cables. Exercise for the reader: start with a BER (Bit Error Rate) of say 10^-12, and figure out the probability of losing a packet. It is very low.
I just had a very interesting discussion with the CCAr (very rare Cisco Architect) who will soon be starting work with us. He’s had two situations where high bit error rates at gig speeds led to occasional double bit errors fooling checksums, causing corrupted data, and lawsuits. The tie-in: when you see high packet loss, you’re probably also not seeing subtly corrupted data that got by the CRC checksum process and got written to a database.
Discards are a great item to watch. Inbound, they usually indicate a bad checksum or other low-level problem. Most routers can keep up with inbound traffic. Outbound, discards indicate the router or switch had to drop the packet, likely because it couldn’t keep up. On switches like Nexus with VOQ’s, you may have to dig a bit deeper to spot internal fabric drops. I expect discards when traffic comes in a 10-Gbps interface and is routed out a 1-Gbps interface. There’s no way the router can make that work.
But discards are also a great smoking gun that may tie back to poor application performance due to device internals. For instance, some of the ISR routers had 1-Gbps interfaces but the processor couldn’t forward a full 1 Gbps of traffic. Discards would be one indication you were running out of router horsepower there. Another place the NetCraftsmen team has seen them is with a 6148A linecard in a 6500 switch. The 6148 used a chip with 1 Gbps total throughput to drive 8 adjacent 1 Gbps ports. Add several newer and busy servers into a block of ports, and you would be seeing discards; the linecard just could not forward, say, 3 Gbps of traffic.
In general, the Cisco datasheets lately have provided IPsec crypto throughput numbers and overall packet forwarding numbers. Always bear in mind your mileage may vary, and test any proposed design changes. One potential gotcha is that (in my recollection) the IPsec max throughput may run about half the pure forwarding throughput. So make sure you check the datasheet, any Cisco Validated Design documents, and even better, test throughput in the lab with a representative traffic mix.
That leads us to a great item that can eat your network: oversubscription. The previous discussion about discards indicated a couple of ways in which oversubscription can happen. The moral there is, you need to know the performance limits and some of the performance-related internals of your hardware. If you buy a Nexus FEX with 48 x 10 Gbps edge ports and 4 x 40 Gbps uplinks, the oversubscription is 480 : 160 = 3 : 1. That is, if you connect to 16 of the edge ports you’re probably fine, but if you connect to more of the 48 x 10 Gbps ports, your aggregate traffic at any moment in time had best not exceed 160 Gbps or you will be discarding.
Today’s networks are complicated and a discussion of key network-performance items to watch for is inevitably a long one. So I’ll continue with a future post about the value of monitoring your syslog data, debug, and more.
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Some Best Operational Process Practices
Networking by the 95th Percentile
The Business of Diagnosing Slow Applications