SD-WAN plus Equinix equals Global WAN
You may have noticed that NetCraftsmen does various types of assessments (network, security, etc.). One of the things I have noticed in doing them is that customer operational practices differ considerably.
This blog goes into some of the things (processes) you should be doing periodically, as well as improving as you repeat doing them. It also covers a couple of just good Operational Practices.
There is a lesson learned here, which applies to a number of areas:
You can put in the time up front in a controlled way, with incremental process improvement every time you do the task, or you can do roughly the same work poorly as a hasty fire drill, over and over again.
For those who say, “but I don’t have the time”, I hear you. This is a “pay me now or pay me later” thing, however. Make the time to make your life better in the future.
To keep the size of this blog under control, let’s focus on network-related tasks. Each technical area has its own. For servers and VM’s, good backups or clones with a reliable process, and a good backup verification process come to mind. You don’t want to find out that the backup you badly need has been failing and nobody noticed, or that every backup you have was corrupted due to something you failed to consider. When cryptolocker hits and the backups seem to all be corrupt – you don’t want to be That Person.
A lot of what follows is periodic maintenance tasks. If you don’t manage them, they won’t get done, or won’t get done fairly regularly, especially if you’re as busy as most networking people seem to be.
What works for me is (a) building a tracking spreadsheet, with tasks and when last performed, and (b) getting them into a calendar, perhaps a listing of weeks in the year, which ones are maintenance weeks and which ones are change-window weeks, etc.
One item that I’ve seen done at three big sites, and recommended elsewhere, is periodic verification of High Availability (HA). This might be done annually or every two years, depending on how often you’ve been burned by failure of failover to work.
The point to this exercise is that planning is human and error-prone, and configurations on devices drift over time. The assumption is you probably don’t want to find out the hard way (downtime) that a High Availability configuration is broken. If that matters enough, you choose to incur the human and other costs of doing testing.
Implementing this entails going through the network diagrams, identifying points where failover should occur, checking configurations to see if the feature seems to be configured correctly. Best Practice is to then actually trigger failover and fail back to be sure it actually works. Generally, this is batched up with some testing being done in each available change window. This is usually low risk, but may disrupt service while it is being tested. The key point here is to identify failover failures on your schedule, and not on Mr. Murphy’s (as in “Murphy’s Law”).
Places and HA features you might test include HSRP, static routes to HSRP or Firewall VRRP VIPs (including making sure the target is the VIP not a “real” device IP), switch stack member failure, routing failover between two WAN routers and links, etc.
If you’re a video-phile, some of us had a chat about HA and Resiliency with Network Collective.
To me, having automatically archived configurations is a great practice. A variety of tools do this, often triggered off the Cisco syslog message that comes when you exit config mode. SolarWinds NCM, Cisco Prime Infrastructure / APIC-EM, NetMRI, and others.
This enables config comparison when there’s an outage, for “what changed?” – quite often the first question asked in troubleshooting. It also enables rollback.
I also like an audit trail (who made the change) for educational / process improvement purposes.
I personally prefer to have an encrypted ZIP of current configs on my laptop as well, to cover for the case where I can’t get to the archive. That’s handy when remote access or the path to the file share isn’t working.
I really like having a robust network device inventory, including at least device name, IP addresses, hardware modules, serial number, current IOS / OS version, and SmartNet or other support contract info. Everything you might want to know.
One reason this is key: synch your config management device inventory, and that in other network management tools to the “master” inventory. If you have auto discovery turned on, which you should, then the tools may catch devices you forgot to add to your inventory. The tools named above can provide the information for the inventory.
By the way, you do use network auto-discovery? We’re well past the dark ages when we had to worry about SNMP causing devices to reboot, or ‘massive’ network traffic, aren’t we? Yes, SolarWinds or other products’ licensing forces manual management of devices. Inefficient.
I encounter a lot of sites with tools with different lists of devices in them. That’s why I think a periodic (annual?) synch-up of inventories is needed — so that you don’t discover a gap in the middle of troubleshooting.
For those who know me, I STRONGLY believe you should be managing every device and every interface. Blind spots are time-wasters. If it costs too much to license this, then you have the wrong tool.
I also like tools that automatically threshold (errors, discards, utilization percentages, in and out) and alert, so you’re aware of problems. Error and discard percentages above 0.001% (or ideally even lower levels) should not be tolerated – fix the cable (it’s usually the cabling, but not always — dirty optics as well).
Yes, you really do need to manage user and server ports. You may have users who think the network is slow because they’ve had a duplex mismatch or faulty cable for years, and you didn’t know it.
I’m a big fan of cached information. Here’s why: when there’s a network crisis, I often see people spending hours digging out information. Manual traceroute from A to B and B to A, writing down the hops, sketching out a diagram. Then digging out which interfaces are involved and looking at their configs. Etc. It’s time-consuming and error-prone, don’t go there.
This is where good network management tools can and do integrate all that information to save you time. NetBrain and SolarWinds have path capabilities that do that to a degree. Too many tools provide “visibility” in the sense of having info buried somewhere within them, but you still end up having to dig around far too much in far too many different places to pull together what you need to know.
Good means it’s all right there when you need it. Bad is when it’s all somewhere, but it takes a two-hour scavenger hunt to pull it all out and put it into a paper-based table.
Cached information includes (a) good diagrams, and (b) having your router names in DNS. And please use short device names following a structured naming convention. Don’t include device type in the name, it makes the name long, hard to remember, and will bite you later (device type is what a good inventory does for you).
Diagrams need to be sustainable (and structured, modular) or they waste time. Cookie-cutter site and campus designs may well mean you can replace diagrams with a generic diagram and an XLS of per-site information. Use common sense. Diagrams have gotten a bad reputation as time-wasters because people overdo them, include too much info, or do them in a way that makes change hard (like poster-sized diagrams).
For those who say they don’t have the time to produce good functional diagrams, I say, “Hey, you waste an hour every time you do the traceroute / sketch thing, plus risking error. You end up doing that repeatedly. Do it right, up front, and save time when it matters!”
Entropy happens — it’s the law (thermodynamics). You need to apply energy to reverse entropy.
Applied to configuration compliance: configurations drift over time — people can be inconsistent or mess up.
Compliance checking tools can help with this, but can be costly (licensing plus adding rules). Home grown tools have to deal with the complexities caused by different syntax and defaults across various Cisco platforms (and “show run all” does not consistently show defaults the way it is supposed to).
New IOS code is risky, but when I see a device that hasn’t been rebooted in 7 years, my reaction is “that’s pretty robust, well-done Cisco (or other vendor)” followed by “oh, but security patches haven’t been applied”.
NetCraftsmen and Cisco generally recommend the “N-1” approach, as in the code version prior to the latest — one that other sites have tested for you, found the severe / common bugs, and has had several patch updates.
We also recommend periodically refreshing code to N-1, perhaps once or twice a year. Many sites don’t remember to do this.
Most net management tools roll up historical measurement data, which flattens out spikes in traffic.
For capacity planning, you might pick some number such as 95th percentile, or 80th percentile, and capture that traffic measure (inbound, outbound) to Excel for key interfaces. Say you do so monthly. Then you can graph the data points, apply a trend line, plug in your annual or quarterly capacity targets. By doing this you get a handle on your perceptions versus actual data, which permits learning and improvement.
Thanks to our Terry Slattery, I like his key point about percentile data: 95th %-ile means that 5% of your measurements were as bad or worse. So, with per-minute data, 72-minute averages were as bad or worse than the 95th %-ile (5% of 1440 minutes). Elaborating on this is a separate blog topic.
Time during Change Windows goes fast. Prep thoroughly in advance; having configlets, rollback configlets, phone numbers / contact info, and all necessary info at hand is key to efficiency. A couple of large sites use tabs in one XLS to bundle it all up in one place.
Having a robust testing plan is also key. Don’t go off the deep end there (depending on criticality). This is more of a process item, not necessarily periodic, but improving how you go about changes.
Experience says that hasty prep frequently correlates with cutover delays and snags. Failure to plan testing can mean gaps that can then bite you Monday morning.
Among other things, people can forget to do things, e.g. they add a VLAN to downlinks, but not to a VPC or other trunk between core switches. It can then take a while to troubleshoot that — time you don’t have.
VIRL modeling of changes in advance can help — although L2 is a bit of an issue there. VIRL can at least catch syntax and routing issues.
A second related practice is validating Layers 1-3 early in the change. Connectivity issues can masquerade as routing or higher-level issues, consuming valuable cutover time. This is CCIE lab advice as well: check your basics (links up, addressing, routing adjacencies all up and stable) before you spend time on complex symptoms.
This is a subject large enough for a separate blog. My impression is that organizations tend to make all sorts of convenient assumptions, setting themselves up for DR failure. All the teams must get their parts right.
What I will emphasize here is having detailed network plans for DR, including configlets, especially if re-configuration on the fly will be needed. And test them. Nothing works until the DR network is up, so all eyes will be on you!
Network and App teams really need to talk about how the apps’ DR failover is intended to work. This facilitates appropriate design, automated failover, and reduced finger pointing if DR happens. Periodic testing helps.
It looks like I’ve used up my lifetime quota of words on the DR topic:
I’ve also written a lot on network management and some of the other topics above. Rather than laboring to provide a long list of links nobody will read, if you’re interested, try Google searching “network management welcher site:netcraftsmen.com” (substitute topic of interest for ‘network management’).
Thanks to our Mike Kelsen for his internal talk about doing one of the above for a customer. That triggered me to emit this blog covering the broader range of operational processes.
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at firstname.lastname@example.org.
SD-WAN plus Equinix equals Global WAN
Network Automation and Cisco Live
What NetCraftsmen Learned at Cisco Live 2018