Replicating at Speed
Cloud technologies are changing the world. Sometimes faster than businesses can keep up, both as consumers of managed or SaaS application, but also on the vendor side.
This blog is about some things I’ve noticed recently as potential pitfalls, and business challenges affecting ability to compete. Finding solutions is part of remaining competitive.
The folks I was working with were aware of them, and you might be too, but it doesn’t hurt to talk about them some.
Some of the Cloud marketing seems to imply that once you put your applications in the Cloud, any latency problems will go away. If only! See below re “Cloud Native”.
Global companies have learned that users in Australia or India running GUI applications based in the U.S. have poor user experiences. The application might be based in say Amazon East or one Cloud location — exactly where likely makes no difference.
High latency tends to do that.
If a cloud vendor / SaaS company has a way to geo-locate you and speed things up, that’s great. If that gets you onto a dedicated global backbone (Microsoft, for example) for better (or more predictable) experience, that sounds like a good plan. If content can be cached or “geo-sharded” (term I’ve just made up), then great, IF that meets your needs.
Bonus points if the application’s caching / replication is faster than plane flights (or whatever the geo-location granularity happens to be). Travelling and then playing “what happened to what I put into the system” is not fun!
Sometimes things don’t work out well.
Recent example: I’ve been told that the trouble ticket system from one large SaaS provider can do regional or country-based “organizations”. But they’re separate entities, each geo-located (geo-deposited?) into a nearby provider-run datacenter. The problem is that the “organizations” are all handled like separate companies and don’t share data. If you’re trying to do follow-the-sun support services, that’s a non-starter.
Those who have been down this path generally understand the issues, and know they have to work with the vendors, to get the products evolved to what is needed. Keeping an eye out for alternative products might also be a good idea — if for no other reason, to spur your current vendor on!
If it’s a case of “the network is slow”, take a deep breath, and then explain. And don’t forget:
It might not be wise to point that out to an irate application manager, however. It might be good to look around to see if there’s any aspect of the situation you can change for the better. Or live with what cannot be changed.
Processing a lot of remote data will likely be slow, unless you can get the processing done near the data.
Transferring the big data to near the consumer applications will also be slow.
If you can cache it or replicate the data in advance, then you’ve traded bandwidth (money!) and disk space for speedier consumer interaction with the data.
One answer to latency and a lot of data to move could be multi-threaded rsync or other form of replication, along with a “fat pipe” (lots of bandwidth).
Conclusion #1: process large data sets where they live or move them in advance to where you (or your processing) live.
Conclusion #2: Depending on what you’re doing, having a “hybrid distributed data lake” may not work well, depending on what you’re doing with it. There are tools that let you treat distributed data in homogenous fashion, but that does not necessarily overcome the effects of latency.
I’m still waiting to see my first instance of an organization with teams putting large data sets into different clouds, then discovering they’d be better off if they were in the same cloud.
I must be hanging around Risk Managers too much lately. As a disclaimer, one of my sisters now does Risk Management, and none of the following discussion is her fault.
Has your organization done due diligence on any external services or managed applications it critically depends on? Resiliency of the application for DR and COOP scenarios might be an issue.
For example, certain types of firms (law, engineering) use document repositories. Check documents in, check them out, etc.
For hospitals, managed EPIC comes to mind (and how hard it might be to connect diversely in certain locations, but that’s another story).
Historically, you could spot such externally-hosted applications by the WAN connection to them. More recently, you might be accessing them over the Internet or WAN or via VPN, and instead of calling it a “managed service” it is now called “SaaS”. Same wine (more or less), new bottle — and likely special new subscription pricing.
All those variations boil down to: someone else is running the application, in their datacenter or in the cloud. Hopefully the datacenter in question, wherever it is, has a robust Tier 3 or 4 infrastructure. That may improve over what you could do in-house. If so, that’s a win.
But the other big question, one you might have trouble getting the answer to, is: what measures has the application or SaaS provider taken to ensure uptime and fast failover / recovery? What is their RPO and RTO? Do they test failover and hit those numbers? Etc.
A customer just noted that they can recover a certain application in about 5 minutes. The number from their application’s managed service was in hours. Oops! I assume the difference might be partly attributable to sheer size / volume at the managed service provider.
That conversation touched on another relevant point: it might take management hours to decide to make it a DR event and failover, especially if resynchronizing database replicas is involved and painful. So RPO = RTO = low technically, but not in practice. The same may well apply to a SaaS offering. “Can we get it back up and blame the Internet, or do we have to fail over and get bad press?”
I’ve seen enough organizations to know that nothing around failover, DR, or COOP can be taken for granted. I’ve seen some thinking that diverged strongly from mine (Of course, as a consultant, I know it all and am always right).
The percentage of real DR / COOP preparedness drops off in proportion to the reciprocal squared number of years since the last major event, i.e. 1 / (years)^2. That rapidly comes to be close to zero!
Bottom line: a monolithic app running in the cloud is still monolithic. It better have rapid failover / DR capabilities, processes, etc. — because you’re betting your company on someone else getting that right. Do you really know the risks behind that bet?
We have the term “cloud-native” for applications written for the cloud. Depending on who you ask (or which Google result you look at), resiliency is part of that.
For companies with legacy apps, making the web front end cloud native may be fairly straight-forward. In general, solving latency for the back end runs into the CAP theorem: distributed database consistency, which is a hard problem. And that’s where the application’s Achilles’ heel may be located.
Anyway, it is probably best to assume that “cloud-native” may or may not include various levels of resiliency. The phase “cloud native” is highly likely to be subject to marketing over-use with consequent dilution of meaning. So, it might be best to ask about the resiliency, and not just assume it is there. The same applies to resolving your global issues with latency.
Cloud and SaaS are still servers and applications, running somewhere. Unless the application is running in multiple places, “somewhere” might suffer an outage. For that matter, if the app is running in multiple locations, it still might suffer a complexity-induced outage. Development and deployment using popular well-tested tools reduce the odds of that happening. Following a well-developed process is another.
Humans and networks are involved. Neither of those are 100% reliable.
If your organization is global, you need to be prepared to deal with latency and data gravity as above, as well as circuit costs and quality, and other challenges. Due diligence helps. It may be easier to obtain some level of detail about design and DR / COOP plans internally, rather than with Cloud / SaaS providers.
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Hashtags: #CiscoChampion #TechFieldDay #TheNetCraftsmenWay #Cloud #SaaS
Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? (And DR / COOP plan review!) Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at email@example.com.
Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.
John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services. Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.
He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.