Tensormesh raises $20M to cut LLM memory waste

OraCore Editors

Back to home

[IND] May 29, 20266 min readOraCore Editors

Tensormesh raises $20M to cut LLM memory waste

Tensormesh raised $20 million from Nvidia, AMD and CoreWeave to reduce LLM reprocessing with KV caching.

Tensormesh AMD Nvidia KV caching LLM inference

Share LinkedIn

Tensormesh raises $20M to cut LLM memory waste

Tensormesh raised $20 million to reduce LLM reprocessing with KV caching.

Tensormesh has raised $20 million to attack a problem every AI team feels in production: large language models keep recomputing the same context over and over. The round brings its total funding to $24.5 million and arrives with the launch of Tensormesh Inference.

Metric	Value	Why it matters
New funding	$20 million	Fresh capital from major AI infrastructure players
Total raised	$24.5 million	Shows the company had already attracted serious backing
Reported cache hit rate	70%+	More than two-thirds of prompts can skip recomputation
Latency and GPU spend reduction	Up to 10x	Claims the biggest payoff for agentic workloads

Why LLMs waste so much compute

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The basic problem is easy to explain and expensive to fix. In a typical deployment, each prompt is treated like a fresh request, even when the model has already seen most of the same context in the same conversation or document.

That means the GPU keeps reprocessing tokens it has already handled. For chatbots, retrieval-heavy apps, and agentic systems that chain many steps together, the wasted work adds up fast.

Tensormesh says its answer is KV caching, a technique that stores intermediate data generated while the model processes a prompt. Instead of rebuilding that internal state every time, the system can reuse it when the next request arrives.

Cache hit rates above 70% mean most prompts avoid full recomputation.
The company says some workloads can see 10x lower latency and GPU spending.
The software is built on the open-source LMCache project.

Why Nvidia, AMD and CoreWeave wrote checks

The investor list says a lot about where this company fits. Nvidia, AMD, and CoreWeave all sell or operate infrastructure that gets more valuable when customers squeeze more useful work out of every GPU cycle.

That makes Tensormesh interesting for a simple reason: it does not try to replace inference hardware. It tries to make the hardware people already buy run hotter, longer, and with less waste.

Founder and CEO Junchen Jiang framed the company’s pitch around a bigger idea than caching alone. As he put it, “Tensormesh offers a new vision on the significance of the intermediate data that LLMs generate when processing a prompt.”

“Tensormesh offers a new vision on the significance of the intermediate data that LLMs generate when processing a prompt.” — Junchen Jiang, founder and CEO of Tensormesh

That quote matters because it points to the company’s real ambition. Tensormesh is trying to turn intermediate AI state into something teams can measure, price, and optimize like any other infrastructure asset.

What the product actually gives developers

The new Tensormesh Inference service is not just a caching layer with a nice name. The company says it includes a dashboard that turns cache hit rates into dollar savings, plus controls for how much storage gets allocated to the cache.

That matters for teams running different kinds of workloads. A small app with modest traffic does not need the same storage profile as an enterprise agent platform with long context windows and repeated document lookups.

Tensormesh says it offers three deployment paths:

A serverless API that is compatible with OpenAI standards
On-demand deployment on dedicated GPU resources
Reserved enterprise deployments with custom service-level agreements

That mix is smart. It gives startups a low-friction way to test the product, while larger customers can buy into a more controlled setup once they see the savings.

How this compares with the usual inference stack

Most inference optimization stories focus on quantization, batching, or better serving frameworks. Tensormesh is attacking a different layer: repeated prompt state. If its numbers hold up in production, the payoff can be immediate because the model is simply doing less work.

Here is the comparison that matters most for buyers:

Traditional inference: reprocesses the full context window on each new request
Tensormesh approach: reuses cached intermediate state when prompts overlap
Reported result: more than 70% cache hit rates in some customer setups
Business result: lower GPU spend and faster responses for multi-step agents

That is especially relevant for agentic AI, where systems can make several calls in a row to complete one task. The longer the workflow, the more likely cached state pays off.

The money from this round will go into hardware integrations with Nvidia, AMD, and CoreWeave, plus product development. Tensormesh also says it will keep feeding improvements back into LMCache, which matters if the company wants developer goodwill instead of a closed ecosystem.

What to watch next

Tensormesh is betting that AI infrastructure buyers will care less about model novelty and more about the cost of repeating themselves. That is a sensible bet in 2026, when inference bills are becoming a bigger line item than training for many teams.

The key question is whether cache hit rates stay high once the product leaves carefully tuned early deployments and lands in messy real-world traffic. If they do, Tensormesh could become a standard add-on for agent platforms, long-context assistants, and document-heavy enterprise apps.

For teams already paying too much for repeated inference, the practical move is simple: measure how much of your prompt traffic is actually repeated state. If the answer is high, caching may be worth more than another round of GPU optimization.

// Related Articles

Tensormesh raises $20M to cut LLM memory waste

Why LLMs waste so much compute

Get the latest AI news in your inbox

Why Nvidia, AMD and CoreWeave wrote checks

What the product actually gives developers

How this compares with the usual inference stack

What to watch next

Five AI coding IDEs that fit real workflows

Devin Desktop turns Windsurf into an agent hub

Korea’s Nvidia talks point to an AI factory push

OpenAI should not rush its IPO just to win the AI race

OpenAI updates its Europe privacy policy

OpenAI is right to keep ads out of sensitive chats