p.enthalabs

Netflix Simplified Batch Compute with Kueue

netflixtechblog.com · Read Story HN original

Comments

Anyone know if Netflix does anything for the k8s storage layer? I imagine they are at the scale where etcd starts to go kaboom? Or maybe they have enough cells where that isn’t a problem?

Given Amazon and Google have their own secret sauce for replacing etcd, I am wondering if Netflix does anything special?

This runs on AWS managed EKS these days, this talk goes into more detail about Netflix's special sauce around the k8s control plane: https://www.youtube.com/watch?v=vaTOiXR2KSM

Netflix actually has much fewer cells than you'd expect btw, their special sauce IMO is federation and using a small subset of k8s APIs.

I am surprised a company at that scale is running on managed EKS, maybe I underestimate how large the clusters are.
EKS can get pretty damn big, well into the thousands of nodes without much special tuning, and beyond that with some care and control plane monitoring. Expensive, though.
Congrats, this is awesome!
It's refreshing to see a tech article that isn't about AI. It feels like 5 years ago.
I see Netflix pumping out tech articles but can't help but notice how much worse the UI experience is getting. Video erroring out, general slowness etc.

Did they just give up?

It certainly feels like Netflix is now a k8s shop. And it probably only a matter of time until they start repatriating workloads to optimize for costs. Then the world will sit up and notice.
I don’t get what you’re implying. What is repatriating; You think they will move their workloads to on-prem?

Is there something different about the world that changed the trade-off calculus for cloud vs on-prem from how it was in the last 15 years compared to now?

(I’m as anti-cloud-overspend as the next guy on hn btw. Just trying to make sense of your comment’s worldview.)

Yes, coding agents have reduced the skills/knowledge required to operate workloads on virtualized hardware. K8S and its ecosystem has changed so that it now provides 90% of what you need from the public cloud providers. Big changes that make 8-15X savings by running your own workloads. I think it will be the big players who move first, as they have most to save and have the resources to make it happen.