Tensormesh raises $20M to cut LLM memory waste
Tensormesh raised $20 million from Nvidia, AMD and CoreWeave to reduce LLM reprocessing with KV caching.

Tensormesh raised $20 million to reduce LLM reprocessing with KV caching.
Tensormesh has raised $20 million to attack a problem every AI team feels in production: large language models keep recomputing the same context over and over. The round brings its total funding to $24.5 million and arrives with the launch of Tensormesh Inference.
| Metric | Value | Why it matters |
|---|---|---|
| New funding | $20 million | Fresh capital from major AI infrastructure players |
| Total raised | $24.5 million | Shows the company had already attracted serious backing |
| Reported cache hit rate | 70%+ | More than two-thirds of prompts can skip recomputation |
| Latency and GPU spend reduction | Up to 10x | Claims the biggest payoff for agentic workloads |
Why LLMs waste so much compute
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The basic problem is easy to explain and expensive to fix. In a typical deployment, each prompt is treated like a fresh request, even when the model has already seen most of the same context in the same conversation or document.

That means the GPU keeps reprocessing tokens it has already handled. For chatbots, retrieval-heavy apps, and agentic systems that chain many steps together, the wasted work adds up fast.
Tensormesh says its answer is KV caching, a technique that stores intermediate data generated while the model processes a prompt. Instead of rebuilding that internal state every time, the system can reuse it when the next request arrives.
- Cache hit rates above 70% mean most prompts avoid full recomputation.
- The company says some workloads can see 10x lower latency and GPU spending.
- The software is built on the open-source LMCache project.
Why Nvidia, AMD and CoreWeave wrote checks
The investor list says a lot about where this company fits. Nvidia, AMD, and CoreWeave all sell or operate infrastructure that gets more valuable when customers squeeze more useful work out of every GPU cycle.
That makes Tensormesh interesting for a simple reason: it does not try to replace inference hardware. It tries to make the hardware people already buy run hotter, longer, and with less waste.
Founder and CEO Junchen Jiang framed the company’s pitch around a bigger idea than caching alone. As he put it, “Tensormesh offers a new vision on the significance of the intermediate data that LLMs generate when processing a prompt.”
“Tensormesh offers a new vision on the significance of the intermediate data that LLMs generate when processing a prompt.” — Junchen Jiang, founder and CEO of Tensormesh
That quote matters because it points to the company’s real ambition. Tensormesh is trying to turn intermediate AI state into something teams can measure, price, and optimize like any other infrastructure asset.
What the product actually gives developers
The new Tensormesh Inference service is not just a caching layer with a nice name. The company says it includes a dashboard that turns cache hit rates into dollar savings, plus controls for how much storage gets allocated to the cache.

That matters for teams running different kinds of workloads. A small app with modest traffic does not need the same storage profile as an enterprise agent platform with long context windows and repeated document lookups.
Tensormesh says it offers three deployment paths:
- A serverless API that is compatible with OpenAI standards
- On-demand deployment on dedicated GPU resources
- Reserved enterprise deployments with custom service-level agreements
That mix is smart. It gives startups a low-friction way to test the product, while larger customers can buy into a more controlled setup once they see the savings.
How this compares with the usual inference stack
Most inference optimization stories focus on quantization, batching, or better serving frameworks. Tensormesh is attacking a different layer: repeated prompt state. If its numbers hold up in production, the payoff can be immediate because the model is simply doing less work.
Here is the comparison that matters most for buyers:
- Traditional inference: reprocesses the full context window on each new request
- Tensormesh approach: reuses cached intermediate state when prompts overlap
- Reported result: more than 70% cache hit rates in some customer setups
- Business result: lower GPU spend and faster responses for multi-step agents
That is especially relevant for agentic AI, where systems can make several calls in a row to complete one task. The longer the workflow, the more likely cached state pays off.
The money from this round will go into hardware integrations with Nvidia, AMD, and CoreWeave, plus product development. Tensormesh also says it will keep feeding improvements back into LMCache, which matters if the company wants developer goodwill instead of a closed ecosystem.
What to watch next
Tensormesh is betting that AI infrastructure buyers will care less about model novelty and more about the cost of repeating themselves. That is a sensible bet in 2026, when inference bills are becoming a bigger line item than training for many teams.
The key question is whether cache hit rates stay high once the product leaves carefully tuned early deployments and lands in messy real-world traffic. If they do, Tensormesh could become a standard add-on for agent platforms, long-context assistants, and document-heavy enterprise apps.
For teams already paying too much for repeated inference, the practical move is simple: measure how much of your prompt traffic is actually repeated state. If the answer is high, caching may be worth more than another round of GPU optimization.
// Related Articles
- [IND]
Five AI coding IDEs that fit real workflows
- [IND]
Devin Desktop turns Windsurf into an agent hub
- [IND]
Korea’s Nvidia talks point to an AI factory push
- [IND]
OpenAI should not rush its IPO just to win the AI race
- [IND]
OpenAI updates its Europe privacy policy
- [IND]
OpenAI is right to keep ads out of sensitive chats