How Total Data Center Visibility Benefits Planning, Design

NetCraftsmen®

Two years ago, Cisco Fellow Navrindra Yadav had an idea — to create a self-driving data center. He saw the amount of time spent and how many people were devoted to managing data center networks and applications. A lack of complete information makes management and troubleshooting more difficult. He knew how important pervasive security is for the data center and realized how hard it is to find attacks. How can you be agile if you’re always trying to keep your data center from breaking? Yadav felt that hardware and software could be united to solve these problems, and he had an idea how to do it. So, he started working on what would become Cisco’s new analytics engine — Tetration.

Through the Cisco Champions program, I had the privilege of a pre-announcement briefing, and then attended the product launch in New York City on June 15th. There, Terry Slattery, Colin Lynch, and I were able to spend an hour with Navrindra and Tetration Product Manager Jothi Prakash. Navrindra explained that this wasn’t a spin-out/spin-in type deal, but rather a homegrown solution. It was developed and tested with Cisco IT until it proved its value to Cisco, which then funded a team to work on it.

Tetration Launch in NYC
From left to right: Lauren Friedman, Denise Donohue, Navindra Yadov, Terry Slattery, Jothi Prakash, Yogesh Kaushik, Colin Lynch

Fast-forward two years, and Cisco IT was ready to migrate an entire Hadoop environment containing eight petabytes of data over the weekend. Because of the application and traffic-flow knowledge gained through Tetration analytics, the manager was confident he was going to make it to his vacation starting that Monday.

By now, you’ve probably seen many postings on how this magic happens. (See Colin’s and Terry’s blog posts.) Real-time metadata is collected from every packet of every flow, every interaction between every device in your data center.

Or, at least that’s the final goal. Right now, “every device” includes the next-gen Nexus 9300 switches and both virtual and bare-metal servers. It can also ingest logs and configuration data from Layer 4-7 services, such as Infoblox. Data is collected through sensors that send telemetry data to Tetration. Server (host) sensors run inside Virtual Machines (VMs), so they see everything the VM sees. This also works on VMs within cloud services, such as AWS, Google, and Azure, and most versions of Linux and Windows are supported.

For the Nexus switches, network sensors are built into an ASIC, which is connected to the backplane of the switch and sees every packet. The sensors look at the first 160 bytes of the packet to extract the metadata, and then send it to the Tetration analytics engine. There is no CPU involvement on the switches. Cisco has measured the network overhead at less than 1% and CPU usage of about one-quarter of a CPU on servers.

It sounds like a really powerful information tool. Which brings us to the questions: Is this a useful tool, or a solution looking for a problem? What would drive you to install an entire rack of gear that lists for $3 million to $4 million?

I have a few thoughts.

Data Center Planning and Design

Are you really confident that you know all the applications in your data center, and all the details about those applications? If so, you’d be the first person I’ve heard of who is! Even Cisco found that it could decommission more than 40% of its VMs based on information from Tetration — a savings in costs and resources. If you’re moving to centralized, software-defined type control, you need to know data flows and application dependencies. What talks to what, with what type of traffic? What access is critical to make this application work? Application Centric Infrastructure (ACI), for example, is based on a whitelisting model. If you don’t know all the connections you need to permit, you could end up causing an outage.

When you’re expanding or refreshing your data center, or building out a new one, accurate application information leads to precise planning and design. How many servers will you need? What are bandwidth and latency requirements for each application? For some applications, for instance, the latency between racks is too high, so dependent servers must be in the same rack. You must be able to identify all dependent servers. Can you imagine how valuable it would be to know that you’ve accounted for every application when building out a new data center? Or to know exactly what application traffic goes through a particular switch before you replace it? Just think of how this would help minimize downtime during a migration.

Tetration has a replay function that lets you do a “what-if” analysis on stored data. I can think of several uses for this, such as:

  • Predicting future growth. You can create a much more accurate picture of future needs by analyzing past usage and changes over time. This would help you size the data center appropriately, so it’s neither over- nor underbuilt. Understanding your company’s data patterns would also help you plan future growth, enabling you to predict future performance and “right-time” data center expansions.
  • Testing data center changes. When Cisco was moving its Hadoop cluster to a new data center, it was confident the new system had been architected correctly because the company used Tetration to replay data flows. This allowed it to correct design flaws and ensure that resiliency was working as planned — before moving anything to production. You can test the effects of adding or subtracting switches or servers or making policy changes. Based on historical data, you can see the effect a change would have at various times in your business cycle (e.g., a retailer’s holiday rush).
  • Testing policy changes. Once you collect data and have a baseline for a specific application, you can run a simulation of proposed policy changes. This will tell you the exact effect that policy would have had on traffic. The simulation ensures that a policy does what it should before you put it into production.

Security and Troubleshooting

I’ve lumped security and troubleshooting together because they use the same capabilities within Tetration. Both benefit from the ability to replay data flows and interactions. On the security side, you could replay an attack to understand precisely what happened and which devices or applications were affected. Then you can use that information to mitigate the effects beforehand, or recognize and shut down future attacks. When troubleshooting, it would be nice to replay the traffic that had a problem, or maybe a specific data flow, to see exactly where the issue occurred. This hard data solves the “is it the network or the application?” question.

Both rely on visibility. As David Goeckeler, Cisco’s Senior Vice President and General Manager of Networking and Security, says, “You can’t stop what you can’t see.” Tetration gives you that visibility because information is collected on all data — not just samples. On the security side, you want to identify an attack as soon as possible; so you need visibility across as many attack vectors as possible.

The gold standard in troubleshooting is to be proactive rather than reactive. Given that Tetration sees all traffic, it can create baselines of expected flows and patterns. Because Tetration can search through billions of flows in less than a second, it can quickly recognize anomalies, at which point a human can be brought in to decide how to respond. The system will suggest remediations and learn from the decisions humans make. The goal is for the system to remediate problems independently at the outset. I think we all recognize that will require a level of trust from the human operators!

Unless you’re in one of those industries that needs to capture everything, the current version is probably overkill for a data center with fewer than 5,000 endpoints. Note that the endpoints do not have to be in one single data center. So long as you have IP connectivity, you can monitor more than one data center by creating encrypted tunnels back to the Tetration cluster. Cisco’s cluster in San Jose, for example, monitors the company’s Texas and North Carolina data centers. WAN bandwidth usage is generally less than a gigabyte. Cisco foresees smaller systems for smaller data centers in the future.

Bottom line: If your data is critical – if your organization’s services can’t go down – then the information and capabilities provided by this system can help. If you’re rolling out software-defined controllers of any stripe, this will improve your outcome. In my opinion, it’s a large step toward the future of data centers and networking in general. We’d be happy to have a deeper discussion of whether Tetration might be right for you. Just reach out.

Leave a Reply