[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-why-routing-is-the-real-bottleneck-in-model-serving-en":3,"tags-why-routing-is-the-real-bottleneck-in-model-serving-en":35,"related-lang-why-routing-is-the-real-bottleneck-in-model-serving-en":45,"related-posts-why-routing-is-the-real-bottleneck-in-model-serving-en":49,"series-industry-7ab95b03-7468-4fef-93bc-0f6f13e61b25":86},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":19,"translated_content":10,"views":20,"is_premium":21,"created_at":22,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":31,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":21},"7ab95b03-7468-4fef-93bc-0f6f13e61b25","Why routing is the real bottleneck in model serving","\u003Cp data-speakable=\"summary\">Routing, not model execution, is the main constraint in modern model serving.\u003C\u002Fp>\u003Cp>I think the biggest mistake in model serving today is treating routing as a plumbing detail. It is not. At scale, the decision of where a request goes determines latency, utilization, cost, and even whether a system stays stable under load. Netflix’s focus on routing in model serving is the right one because the serving stack is no longer just about running a model fast; it is about choosing the right model, the right replica, and the right path for every request.\u003C\u002Fp>\u003Ch2>Routing now shapes the economics of serving\u003C\u002Fh2>\u003Cp>The first reason routing matters is simple: every wasted request is a direct tax on inference cost. If a system sends traffic to a cold replica, a saturated \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa>, or a model version that is not the best fit, the serving layer burns money before the model even starts doing useful work. In high-volume environments, small routing inefficiencies multiply into real infrastructure spend.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278838114-82jm.png\" alt=\"Why routing is the real bottleneck in model serving\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Netflix is a good example of why this is no longer theoretical. A company serving personalized experiences at massive scale cannot afford a naive round-robin approach and call it a day. Routing has to account for model availability, traffic patterns, and operational constraints, because the wrong destination can create tail latency and uneven utilization across the fleet. That is an economics problem, not just an architecture choice.\u003C\u002Fp>\u003Ch2>Routing is now part of model quality\u003C\u002Fh2>\u003Cp>The second reason is that routing affects output quality, not just performance. In modern serving systems, the router often decides which specialized model handles a request, which version is promoted, or whether a fallback path should take over. That means routing policy becomes part of the product surface. If the router is weak, the user experiences a weaker system even when the underlying models are strong.\u003C\u002Fp>\u003Cp>We already see this pattern in systems that use ensembles, canaries, or per-segment model selection. A recommendation model for one audience segment can be excellent and wrong for another. A routing layer that understands request context can preserve relevance, while a simplistic dispatcher can flatten those differences and degrade results. In other words, routing is not separate from model intelligence. It is one of the mechanisms that lets intelligence show up in production.\u003C\u002Fp>\u003Ch2>The old serving mindset is too narrow\u003C\u002Fh2>\u003Cp>The third reason I reject the old mindset is that many teams still design serving around a single model endpoint and then bolt on scaling later. That approach fails once the stack includes multiple models, heterogeneous hardware, traffic shaping, and fast rollouts. The serving problem is no longer “how do I expose inference?” It is “how do I continuously place work across a changing fleet?”\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278848065-3qv8.png\" alt=\"Why routing is the real bottleneck in model serving\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That shift is exactly why a series on routing is useful. It signals that the hard part is not the inference call itself but the control plane around it. Once a company has enough traffic, the router becomes the system that keeps experiments isolated, failures contained, and expensive capacity actually used. Teams that ignore that reality end up building brittle serving stacks that look simple on paper and collapse under operational pressure.\u003C\u002Fp>\u003Ch2>The counter-argument\u003C\u002Fh2>\u003Cp>The strongest objection is that routing can become over-engineered. Many teams do not need sophisticated placement logic, dynamic policies, or multi-stage decisioning. A small product with one model, one hardware tier, and modest traffic can get by with a straightforward load balancer and a single deployment target. In that context, spending too much time on routing is wasted effort.\u003C\u002Fp>\u003Cp>That objection is valid at small scale. If your serving footprint is tiny, the router should stay boring. But that is not an argument against routing as a discipline. It is an argument against premature complexity. The moment a team introduces multiple models, rollout strategies, or hardware constraints, routing stops being optional and starts being the mechanism that protects reliability and cost. The mistake is not building routing too early. The mistake is pretending you will not need it later.\u003C\u002Fp>\u003Ch2>What to do with this\u003C\u002Fh2>\u003Cp>If you are an engineer or platform owner, treat routing as a first-class subsystem and give it the same scrutiny you give model training and deployment. Measure tail latency, replica saturation, fallback rates, and per-route cost. Design for policy changes, not just static load balancing. If you are a PM or founder, push your team to define routing requirements alongside model requirements, because the serving strategy you choose will shape both user experience and cloud spend. Build the control plane before the scale forces you to.\u003C\u002Fp>","Routing, not model execution, is the main constraint in modern model serving.","netflixtechblog.com","https:\u002F\u002Fnetflixtechblog.com\u002Fstate-of-routing-in-model-serving-16e22fe18741?gi=a78a5e08192d",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278838114-82jm.png",[13,14,15,16,17,18],"Netflix","model serving","routing","latency","inference cost","control plane","en",3,false,"2026-05-08T22:20:22.905795+00:00","2026-05-08T22:20:22.706+00:00","done","fd993627-23d9-4ef6-9244-64aa9ae6387d","why-routing-is-the-real-bottleneck-in-model-serving-en","industry","5b27896f-ad48-4a9a-8b6e-823568d8c669","published","2026-05-09T09:00:14.457+00:00",[32,33,34],"Routing determines latency, cost, and stability in model serving.","Routing policy is part of model quality because it shapes which model serves each request.","Simple load balancing is enough only until traffic, models, and hardware diversity make it fail.",[36,39,41,42,43],{"name":37,"slug":38},"Model Serving","model-serving",{"name":17,"slug":40},"inference-cost",{"name":15,"slug":15},{"name":16,"slug":16},{"name":13,"slug":44},"netflix",{"id":28,"slug":46,"title":47,"language":48},"wei-shen-me-lu-you-cai-shi-mo-xing-fu-wu-de-zhen-zheng-ping-zh","為什麼路由才是模型服務的真正瓶頸","zh",[50,56,62,68,74,80],{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":27},"6ff3920d-c8ea-4cf3-8543-9cf9efc3fe36","circles-agent-stack-targets-machine-speed-payments-en","Circle’s Agent Stack targets machine-speed payments","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778871659638-hur1.png","2026-05-15T19:00:44.756112+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":27},"1270e2f4-6f3b-4772-9075-87c54b07a8d1","iren-signs-nvidia-ai-infrastructure-pact-en","IREN signs Nvidia AI infrastructure pact","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778871059665-3vhi.png","2026-05-15T18:50:38.162691+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":27},"b308c85e-ee9c-4de6-b702-dfad6d8da36f","circle-agent-stack-ai-payments-en","Circle launches Agent Stack for AI payments","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778870450891-zv1j.png","2026-05-15T18:40:31.462625+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":27},"f7028083-46ba-493b-a3db-dd6616a8c21f","why-nebius-ai-pivot-is-more-real-than-hype-en","Why Nebius’s AI Pivot Is More Real Than Hype","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778823055711-tbfv.png","2026-05-15T05:30:26.829489+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":27},"b63692ed-db6a-4dbd-b771-e1babdc94af7","nvidia-backs-corning-factories-with-billions-en","Nvidia backs Corning factories with billions","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778822444685-tvx6.png","2026-05-15T05:20:28.914908+00:00",{"id":81,"slug":82,"title":83,"cover_image":84,"image_url":84,"created_at":85,"category":27},"26ab4480-2476-4ec7-b43a-5d46def6487e","why-anthropic-gates-foundation-ai-public-goods-en","Why Anthropic and the Gates Foundation should fund AI public goods","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778796645685-wbw0.png","2026-05-14T22:10:22.60302+00:00",[87,92,97,102,107,112,117,122,127,132],{"id":88,"slug":89,"title":90,"created_at":91},"d35a1bd9-e709-412e-a2df-392df1dc572a","ai-impact-2026-developments-market-en","AI's Impact in 2026: Key Developments and Market Shifts","2026-03-25T16:20:33.205823+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"5ed27921-5fd6-492e-8c59-78393bf37710","trumps-ai-legislative-framework-en","Trump's AI Legislative Framework: What's Inside?","2026-03-25T16:22:20.005325+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"e454a642-f03c-4794-b185-5f651aebbaca","nvidia-gtc-2026-key-highlights-innovations-en","NVIDIA GTC 2026: Key Highlights and Innovations","2026-03-25T16:22:47.882615+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"0ebb5b16-774a-4922-945d-5f2ce1df5a6d","claude-usage-diversifies-learning-curves-en","Claude Usage Diversifies, Learning Curves Emerge","2026-03-25T16:25:50.770376+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"69934e86-2fc5-4280-8223-7b917a48ace8","openclaw-ai-commoditization-concerns-en","OpenClaw's Rise Raises Concerns of AI Model Commoditization","2026-03-25T16:26:30.582047+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"b4b2575b-2ac8-46b2-b90e-ab1d7c060797","google-gemini-ai-rollout-2026-en","Google's Gemini AI Rollout Extended to 2026","2026-03-25T16:28:14.808842+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"6e18bc65-42ae-4ad0-b564-67d7f66b979e","meta-llama4-fabricated-results-scandal-en","Meta's Llama 4 Scandal: Fabricated AI Test Results Unveiled","2026-03-25T16:29:15.482836+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"bf888e9d-08be-4f47-996c-7b24b5ab3500","accenture-mistral-ai-deployment-en","Accenture and Mistral AI Team Up for AI Deployment","2026-03-25T16:31:01.894655+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"5382b536-fad2-49c6-ac85-9eb2bae49f35","mistral-ai-high-stakes-2026-en","Mistral AI: Facing High Stakes in 2026","2026-03-25T16:31:39.941974+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"9da3d2d6-b669-4971-ba1d-17fdb3548ed5","cursors-meteoric-rise-pressures-en","Cursor's Meteoric Rise Faces Industry Pressures","2026-03-25T16:32:21.899217+00:00"]