Click here to request your free 14-day trial of Cisco Umbrella through NetCraftsmen today!

The topic of Data Center Interconnect at L2, and failover between data centers, seems to be a hot topic! I’d like to briefly note some of the great interaction that’s been occurring with some of my prior blogs, and also note an interesting article by Ivan Pepelnjak, whose technical skills I highly respect.

Concerning Data Center Interconnect (DCI), I really like the new Cisco OTV technology. (OK, you probably guessed that from all the prior articles about it.) Whatever technique you use for DCI, interconnecting data centers at L2, you have the problem of managing and optimizing inbound and outbound traffic to use the shortest path to or from the active virtual machine (VM) or cluster member(s).

Comments and discussion about that can be find in various of my blogs. I appreciate the feedback and the chance to learn and discuss! See in particular the article on Understanding Layer 2 over Layer 3 (Part 2). I’ve noted below recent prior blogs and how many comments, for those inclined to go spelunking around this topic.

Other Takes on DCI and Optimal Routing

I was interested and amused to see a blog by Ivan Pepelnjak about Long Distance vMotion, at http://searchnetworking.techtarget.com/feature/Long-distance-vMotion-traffic-trombone-so-why-go-there?asrc=EM_NLN_13283529&track=NL-79&ad=813626.

Ivan makes a good case for multiple servers and Server Load Balancers in front of them, especially as a way of avoiding Data Center bridging. He also mentions “traffic tromboning”, which is the sub-optimal traffic flows I’ve blogged about.

I’m not as allergic to data center bridging as he seems to be, despite having seen my share of data center meltdowns due to spanning tree loops. I hear good things about traffic storm control, and with L2 over L3 it seems like the L3 encapsulation will fail or reach capacity before the problem spreads as widely — now that would be an interesting lab experiment. On the other hand, I like the idea of not using a risky technology unless you’ve got a darn good reason. Ivan raises one: some sort of server or application running on a single platform. And also of keeping management complexity down.

Overdoing It?

Frankly, with the DCI techniques, and especially with OTV, I worry about the “beer principle”. (One beer might be a good thing. Too many beers leads to a headache.) Applied to OTV, you start doing it on a small scale, then it grows, and then one day you discover you’ve pushed it too far, it is unstable or having some problem … and you have a headache.

I particularly worry about this in medical settings, and perhaps federal government datacenters. Hospitals tend to use real estate for medical needs, in good part because frankly that produces revenue. The management usually views IT as a necessary evil, a cost center. Funding for new hospitals and clinics takes priority over a new datacenter. Consequently, you end up with little datacenters scattered all over. A variant happens with the federal government. The current data center gets outgrown, then there isn’t funding to build a new one that can hold all the servers (and after all, the old one is still working, albeit perhaps maxed on power and cooling or space), so another smallish one gets added one. Then maybe some space shows up in some other agency’s datacenter (after the recent consolidation push), so that gets tacked on. When you view the overhead costs of operating small datacenters compared to one big one, this may be vastly suboptimal — but there’s apparently no good political and financial way out of it.

Over time, this scattered data center approach leads to design problems. One generally brings WAN links into the datacenter, but because the “main and backup” datacenters changed over time, the WAN links come in all over the place. Ditto Internet connections. With L2 between them, one might use stateful firewall pairs split across data centers, also L2 server clusters.

Split Firewall Pairs or Server Clusters

I do have a different concern about that situation, which is robustness of the cluster or stateful firewall pair. Links between datacenters might be L2 (or L2 over L3), but are generally not as reliable and error-free as links within a single datacenter. What happens to your firewall pair, or your Server Load Balancer pair, or your cluster, if the link between sites goes flaky but not down? I’ve heard horror stories about CheckPoint firewalls and Microsoft Exchange clusters where both sides thought they were primary. The issue seems to be that packet loss is not expect by the vendor, and when the link comes back up … SURPRISE! In the case of CheckPoint, they both update each other’s policy, which can lead to corrupted policy (i.e. missing ACL rules). In the case of Exchange, I’ve heard from one person who had experienced it that the servers started to re-synch their databases of email, but for 6 hours they then did not respond to email clients. I freely admit, I do not know nor have tested current versions of either.

If you have experience with situations like this (where the app continued to work correctly, or where there were problems), please add a comment — more data is needed on this phenomenon, and collectively we have a lot more experience than I by myself do!

For some prior musings about this issue, see

Prior DCI and OTV Articles

Here are just the recent ones, see also the archives.

Peter Welcher

Peter Welcher

Architect, Operations Technical Advisor

A principal consultant with broad knowledge and experience in high-end routing and network design, as well as data centers, Pete has provided design advice and done assessments of a wide variety of networks. CCIE #1773, CCDP, CCSI (#94014)

View more Posts

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.