[RSCH] 8 min readOraCore Editors

EntityBench Tackles Long-Range Video Consistency

EntityBench measures whether video models keep characters, objects, and locations consistent across long, multi-shot sequences.

Share LinkedIn
EntityBench Tackles Long-Range Video Consistency

EntityBench measures whether video models keep characters, objects, and locations consistent across long, multi-shot sequences.

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation is aimed at a very practical problem: video generators can look good in a single shot, but fall apart when the same character, object, or location has to reappear later in a longer sequence. For developers building narrative video systems, that gap matters because visual continuity is part of the product, not a nice-to-have.

The paper argues that current evaluations do not stress this problem enough. Many existing tests use independently generated prompts, cover only a limited set of entities, and rely on simple consistency metrics. That makes it hard to compare models in a standardized way, especially when the real challenge is keeping identities stable over long distances in a story.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Long-range multi-shot video generation is more than stitching together attractive clips. A system has to preserve who is who, what belongs where, and whether the same visual details survive across shots that may be separated by many intervening scenes. In practice, that means a character should still look like the same character after a long recurrence gap, an object should remain recognizable, and a location should stay visually coherent when it comes back into the story.

EntityBench Tackles Long-Range Video Consistency

The authors say existing evaluation setups do not really capture that. If prompts are generated independently, they may not force a model to remember an entity across multiple shots. If the metric is too simple, it can miss whether the model actually rendered the entity correctly before scoring consistency. EntityBench is designed to make that failure mode visible.

For engineers, this is the difference between testing a demo and testing a system that can survive a real narrative workload. A model that succeeds on isolated prompt clips may still be unusable for episodic content, storyboarding, or any workflow where continuity is part of the spec.

How EntityBench is built

EntityBench is a benchmark made from real narrative media. It includes 140 episodes and 2,491 shots, which gives it a much more story-like structure than a collection of unrelated prompts. The benchmark uses explicit per-shot entity schedules that track characters, objects, and locations together across the sequence.

The paper organizes the benchmark into easy, medium, and hard tiers. Those tiers go up to 50 shots, with up to 13 cross-shot characters, 8 cross-shot locations, and 22 cross-shot objects. It also includes recurrence gaps spanning up to 48 shots, which is important because consistency problems usually get worse as the distance between appearances grows.

That setup matters because it turns “remember this entity later” into a concrete evaluation task. Instead of assuming a model should somehow preserve identity, EntityBench specifies when each entity should appear again and how far apart those appearances are. That gives researchers a controlled way to test long-range memory effects in generation systems.

The three-pillar evaluation suite

EntityBench is not just a dataset. It is paired with a three-pillar evaluation suite that separates different parts of the problem so they do not get mixed together in one vague score.

EntityBench Tackles Long-Range Video Consistency
  • Intra-shot quality checks whether each individual shot looks good on its own.
  • Prompt-following alignment checks whether the model followed the requested content.
  • Cross-shot consistency checks whether entities stay stable across shots.

The paper also adds a fidelity gate. Only accurate entity appearances are allowed into cross-shot scoring. That is a useful design choice because it avoids giving a model credit for being “consistent” across shots if it never rendered the entity correctly in the first place.

For practitioners, this is the kind of evaluation structure that makes debugging possible. If a model scores poorly, you can tell whether the issue is basic visual quality, prompt adherence, or long-range identity retention. Without that separation, you end up guessing where the pipeline is failing.

What the baseline system does

To show how the benchmark behaves, the authors propose EntityMem, a memory-augmented generation system. Its core idea is straightforward: before generation begins, it stores verified per-entity visual references in a persistent memory bank. In other words, the model does not have to rediscover what a character or object looks like every time it reappears.

That design is practical because entity consistency is often a memory problem as much as a generation problem. If the system can retrieve a trusted reference for a character, it has a better chance of reproducing the same visual identity later in the sequence. The paper presents EntityMem as a baseline, not as a final solution, but it is a clear example of how explicit memory can support long-form video generation.

What the paper actually shows

The main result is that cross-shot entity consistency degrades sharply as recurrence distance increases in existing methods. That is the core failure EntityBench is built to expose. The further apart two appearances are, the harder it becomes for the model to preserve the same character, object, or location.

The paper also reports that explicit per-entity memory yields the highest character fidelity and presence among the methods evaluated. The reported effect size for character fidelity is Cohen’s d = +2.33. The abstract does not provide a full table of benchmark numbers, so there are no additional metrics to quote here beyond that effect size and the qualitative finding about degradation with distance.

That result is important even without a long leaderboard. It suggests that long-range consistency is not just a matter of scaling up generation quality; it may require explicit mechanisms for storing and reusing entity-specific visual references. For teams building multi-shot video systems, that points toward memory-aware architectures rather than purely prompt-driven sampling loops.

What developers should take away

If you are building or evaluating video generation systems, EntityBench gives you a more realistic stress test for narrative continuity. It is especially relevant for tools that need to handle recurring characters, repeated props, or stable locations across an episode-length sequence.

It also gives you a cleaner mental model for failure analysis. A model can fail because it renders a shot poorly, because it misses the prompt, or because it cannot keep an entity stable over time. EntityBench tries to separate those cases so you can tell whether you need better decoding, better conditioning, or some kind of persistent memory layer.

There are still open questions. The abstract does not spell out the full benchmark protocol, the exact scoring details beyond the fidelity gate, or how well EntityMem generalizes beyond the evaluated methods. It also does not claim that memory alone solves long-range video generation. What it does show is that consistency breaks down with distance, and that explicit entity memory is a promising direction.

For the broader ecosystem, that is a useful signal. As video models move from short clips to longer stories, the hard part stops being “can the model generate motion?” and becomes “can it remember the story world?” EntityBench is built to measure exactly that.