IPT helps VLMs reason about hidden space

OraCore Editors

Back to home

[RSCH] June 3, 20268 min readOraCore Editors

IPT helps VLMs reason about hidden space

Imaginative Perception Tokens improve multimodal models’ ability to reason about unseen spatial structure.

intermediate supervision multimodal models vision-language models chain-of-thought spatial reasoning

Share LinkedIn

IPT helps VLMs reason about hidden space

Imaginative Perception Tokens improve multimodal models’ ability to reason about unseen spatial structure.

Research org: Unspecified in arXiv abstract
Core data: 20K examples
Breakthrough: Supervises intermediate perceptual tokens for alternative spatial configurations

Vision-language models are already useful at recognition and many forms of visual QA, but this paper targets a harder gap: spatial reasoning when the answer depends on what is not directly visible. That matters for any system that has to infer hidden layout, reason about occlusion, or mentally rotate a scene before answering.

The paper’s core idea is simple but important: instead of forcing the model to do all of that reasoning in plain text, it gives the model a new kind of intermediate supervision called Imaginative Perception Tokens, or IPT. These tokens are meant to represent what the model would perceive from an alternative spatial viewpoint while still staying consistent with the observed image.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Multimodal language models can look at an image and answer questions about it, but they often stumble when the necessary information is hidden, occluded, or only implied by the visible scene. The abstract calls this “imaginative perception”: inferring what would be seen from another viewpoint, tracing a path through blocked space, or combining partial observations into a coherent spatial picture.

That is a real limitation for developers building assistants that need to reason about rooms, objects, maps, diagrams, or robot environments. If the model can only talk about what is directly visible, it will miss questions that depend on geometry, viewpoint, and unobserved structure.

Rather than treating this as a pure language problem, the paper argues that spatial computation should not always be forced through text. The authors suggest that this can create a modality mismatch, especially when the reasoning target is visual or spatial rather than linguistic.

How IPT works in plain English

IPT stands for Imaginative Perception Tokens, and the role of these tokens is to externalize a model’s intermediate spatial imagination. In practice, that means the model is trained to represent what it would “see” under a different spatial configuration before it commits to an answer.

The important part is that IPT is not just a free-form explanation. It is a supervision signal designed to stay grounded in the input while exposing the hidden spatial structure the model needs for reasoning. That makes it different from a standard chain-of-thought style prompt, which may encourage the model to reason in language even when the task is fundamentally visual.

The paper studies this idea through three tasks: Perspective Taking, Path Tracing, and Multiview Counting. These tasks are meant to probe whether the model can infer unseen viewpoints, follow routes through occluded space, and count objects across multiple views.

Perspective Taking (PET): infer what would be visible from another viewpoint.
Path Tracing (PT): reason through routes in partially blocked space.
Multiview Counting (MVC): integrate multiple views to count consistently.

To support those tasks, the authors built datasets of about 20K examples with ground truth imaginations, answers, and evaluation benchmarks. The abstract does not provide the full construction details, but it does make clear that the supervision is explicit and task-specific rather than borrowed from generic captioning or QA data.

What the paper actually shows

The experiments use the unified VLM BAGEL as the backbone. With IPT supervision, the model consistently improves on spatial reasoning tasks and often beats textual chain-of-thought training, even when it does not generate images at inference time.

That last part is worth pausing on. The model is not required to synthesize images during deployment to benefit from IPT. The supervision happens during training, but the learned representation still helps at inference, which makes the approach more practical for real systems.

The abstract gives a few concrete results. On Multiview Counting, IPT improves accuracy by 3.4%. On Path Tracing, it reaches competitive performance with strong closed-source models. The paper does not provide a full benchmark table in the abstract, so those are the only numeric results available here.

The authors also report that combining IPT with label-only supervision gives additional gains. That suggests the method is complementary to ordinary answer supervision rather than a replacement for it.

By contrast, textual chain of thought can substantially degrade performance in this setting. That is a useful warning for practitioners: more verbal reasoning is not automatically better, especially when the underlying task is spatial and the model may be better served by an intermediate perceptual representation.

Why developers should care

If you are building multimodal agents, robotics helpers, diagram readers, or any system that has to reason about hidden geometry, IPT points to a training strategy that targets the failure mode directly. Instead of asking the model to narrate its way through a spatial problem, you supervise a structured intermediate representation that is closer to the actual computation the task requires.

That could matter in cases where chain-of-thought style prompting looks helpful on the surface but leads the model away from the right modality. The paper’s results suggest that for spatial tasks, language can become a bottleneck rather than a benefit.

There is also an interpretability angle. Because IPT produces intermediate perceptual representations, it may be easier to inspect than a purely textual rationale. The abstract does not claim full transparency, but it does frame the tokens as interpretable intermediate structure.

What the paper does not prove

This is still a focused study, not a general solution to multimodal reasoning. The abstract only mentions one backbone model, BAGEL, so it is not yet clear how broadly IPT transfers across architectures or datasets.

The paper also does not claim that IPT solves all forms of spatial reasoning, only that it improves a specific class of problems involving unobserved spatial structure. And while the datasets are sizable for a research benchmark at roughly 20K examples, they are still task-shaped datasets rather than broad real-world deployment data.

Another limitation is that the abstract gives only a small slice of the evaluation detail. We know IPT improves MVC by 3.4% and is competitive on PT, but we do not get the full benchmark breakdown here. So the safest reading is that IPT is a promising training signal, not a settled standard.

Bottom line

The useful takeaway is that multimodal models may need better intermediate representations, not just bigger prompts or more textual reasoning. IPT gives the model a way to “imagine” hidden spatial structure during training, and that seems to help when the answer depends on what cannot be directly seen.

For engineers, the paper is a reminder to match the supervision to the computation. If the task is spatial, forcing everything through language may be the wrong abstraction. IPT is one concrete attempt to fix that, and the early results suggest the idea is worth watching.

// Related Articles

IPT helps VLMs reason about hidden space

What problem this paper is trying to fix

Get the latest AI news in your inbox

How IPT works in plain English

What the paper actually shows

Why developers should care

What the paper does not prove

Bottom line

CRDTs keep replicas in sync without locks

Post-Deterministic Systems for Autonomous Infra

Causal methods for measuring task learnability

RL Training That Hands Off Control Gradually

OmniGameArena benchmarks VLM game agents better

TurboQuant cuts KV cache memory 6x in Google tests