No Escape from Data+ML Lock-in
6 min read

No Escape from Data+ML Lock-in

The ecosystem of tools & services needed for Data & ML prevents workload portability unlike compute, storage and memory workloads that can leverage the hybrid cloud.
No Escape from Data+ML Lock-in

Update: Cloudflare's recent article on egress pricing, talking about how egress pricing hasn't kept pace with technological advances and is a lock-in mechanism acts to compound the frictions described below.


a16z's article on cloud costs calls out how adopting the cloud comes at the expense of operating margins. For several businesses (as the Twitter thread shows), this isn't an issue; top-line growth is more critical. Inevitably, a company shifts into a bottom-line or margin focus. This is where Sarah and Martin call out that the cloud taketh. A hybrid cloud strategy manages this dichotomy using standard OSS offerings to move compute, memory and storage (CMS) workloads flexibly from the cloud to corporate DCs. The problem is that data and ML workloads, critical drivers of value generation in current tech companies, can't operate in this hybrid cloud model. They are locked to cloud vendors because of the substantial network effect their managed ecosystem provides and the unparalleled price-to-performance ratio from deep vertical integration. Business centrality, combined with the considerable CAGR of Data+ML, makes this far more concerning to me than the portability of CMS workloads.

Before digging in, a point to clarify: Sarah and Martin are talking about bringing workloads from a Cloud provider to a company owned & operated data centre (O&O DC) in a one-time exodus (repatriation). General hybrid cloud approaches offer flexibility in the location (e.g. Azure, AWS, O&O DCs) and the movement of workloads (e.g. multihoming services "continually" based on price). When it comes to Data+ML, for the assertions above, the distinction isn't highly relevant. Where it matters is when there's a lot more flexibility in moving workloads that I'll cover towards the end (hint: egress fees).

Network Effects in Data & ML Offerings

Building a single ML model requires several services: data ingestion, streaming/batch processing, warehousing, training to name just a few. End users, such as data scientists and machine learning engineers, repeatedly traverse this entire graph of services to get their work done. As a result, weak links become significant bottlenecks; results suffer since useful tools are effectively walled off or wall times exceed users' fatigue thresholds. By building managed services that optimise this chain-link system, cloud providers unlock a tremendous amount of value. Beyond optimising a golden path, cloud providers work hard to have these services interconnect exceedingly well. This factorial interconnectivity is essential to users navigating amorphous ML and DS problems by effortlessly switching between tools, services and approaches as the problems dictate. This network effect and the corresponding value that the organisation is reaping needs to be overcome when repatriating. Choices here can have substantial impacts on user and organisational productivity.

Compounding Value with Vertical Integration

TPUs from GCP are the best example of this hurdle to workload portability: a custom-designed ASIC accelerator that has an outstanding price-to-performance ratio for select ML inference workloads. Unless you're willing to spin up an ASIC design division and have a substantive scale to amortise significant hardware design expenses, it's hard to beat this one. It isn't just about the hardware; the entire stack from software to hardware is jointly optimised. At Twitter, we noticed that system calls like memcpy are optimised as part of training on GKE (Google's Kubernetes offering) using NVidia GPUs. This deep vertical integration compounds the raw performance gains of cloud providers' strategic assets (e.g. TPUs). The rapid technological evolution in the ML space, both hardware and software, isn't helping ease portability. Replicating this degree of optimisation requires a skills-diverse talent pool and strategic partnerships with hardware vendors, further raising the bar for workload repatriation.

Other Frictions: Operations & Egress

Behind the success of cloud providers are substantial operations teams. Deep operational knowledge helps drive costs down, and building this from scratch in a vertically integrated ecosystem takes a lot of patience and commitment. The number of services required in Data+ML use cases magnifies operational needs too.

Colocating data consumers and producers is the status quo best practice, but this doubles requirements when repatriating. The problem is that there isn't an alternative: separating these two workloads gets pinched by egress costs. If GCP has the best Data+ML ecosystem and AWS has the best CMS solutions, what's to be done? Lay fibre to peering points? Expensive; however, cloud providers offer a direct connection product with near-zero unit economics, which makes this a viable margin improvement tactic. It does sound an awful lot like a telecoms business when you start plumbing many distinct cloud vendors across several geographies, though, bringing along a comparable bar for return-on-capitalequinix.

What is to be done?

Optimising for margins by repatriating cloud workloads is a reasonable solution depending on the focus of your company. Unfortunately, it isn't an option for Data+ML. OSS solutions, like Kubeflow, are single-point solutions with no end-to-end ecosystem emerging (yet). While pathways to portability on the software stack are emerging, there's little on the hardware side, leave alone competitive vertically integrated offerings. The ongoing and rapid evolution of the ML domain is both a feature and a bug: on the one hand there are opportunities to redress this barrier,  on the other, technical investments become obsolete much sooner.

I'm not sure if a general product that competes with the cloud can exist. Still, for established and well-understood use cases and an appetite to lose on performance but save on price, a repatriation option is definitely viable. Moving ML workloads between clouds is a much more feasible proposition as competition between providers fuels a technology arms race; Kubernetes is arguably an example of this. Perhaps such an arms race would yield vendor-independent solutions. Today, a well-heeled Cloud customer with clearly defined use cases and workflows, a deep appetite to build, maintain and advance a sticky Data+ML platformchoice, can take a crack at finding a globally optimal price-performant solution. This becomes easier if they're willing to sacrifice on performance.

Underpinning all of this optimisation is fundamental visibility of your costs! Billing consoles have done a solid job exposing usage, but you often need to extract data for a deeper analysis. GCP, for example, lets you export to BigQuery for detailed analysis. Getting cost details across cloud providers and on-prem is more challenging, especially in a standardised form. I'm hoping this is where tools like Vantagefounder can provide an edge: aggregating and standardising multi-cloud billing. Taking it a step further and seeing product-line level PnL with ease, gauging how Cloud spend (or the lack of it) is affecting product outcomes, is the dream.

At Twitter, we're still in the early days of operating in a hybrid cloud environment. We've identified that it makes sense to rely on the Cloud for our present needs: focus and velocity. Consequently, in the Data+ML space, we are focused on our partnership with GCP. We have a more balanced perspective for CMS workloads and are thinking about how the cloud and our O&O DCs interact with capacity, technology, and costs. We're working with AWS and GCP here to figure out what makes sense as a company. There's so much more to be done, and over the coming years, I think the industry at large will start figuring out how portability comes together for both Data+ML and CMS workloads.

Epilogue: Spot Pricing

I've also observed a key difference in pricing strategy between CMS and Data+ML products from cloud providers: the lack of spot pricing. CMS products such as EC2 offer a discounted price for surplus capacity, but Data+ML managed products generally don't have this option. Sustained usage or minimum commitment discounts certainly exist, but spot (preemptible on GCP) pricing offers substantially better price points, a pricing feature that never made it to the Data+ML product space. If you're rolling solutions atop managed Cloud infrastructure, spot pricing is still viable, but not for managed services. Unfortunately, high-value Cloud products such as data warehouses come only as a managed service.

Thanks

Spades of thanks to a bunch of folks like Alex Esber and Ben Schaechter for providing feedback on drafts.


  1. I believe Equinix operates in this space and would be a potential partner. I have to call out that I'm not very knowledgeable about optionality and cost trade-offs here.
  2. This is really hard to verify, in a counterfactual sense, given the implicit bias towards convergence when focusing on cost optimisation. The only counterpoint that I've found, and we've seen at Twitter, is to mandate choice: teams have the choice to use the best solution for their needs. This creates more diversity in the solutions pool that customer-focused Platform teams can leverage to understand gaps and evolve offerings.
  3. Founder Ben Shaechter is a good friend of a friend and how I discovered Vantage.

Enjoying these posts? Subscribe and receive new ones.