We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. Our evaluation reveals that MLLMs exhibit competitive visual-spatial intelligence, if still well short of human-level. To understand the MLLMs' behavior, we probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models.

VSI-Bench

Benchmark Overview: We develop VSI-Bench, a benchmark to evaluate the visual-spatial intelligence of Multimodal LLMs (MLLMs) using over 5,000 question-answer pairs derived from 288 egocentric videos sourced from the validation sets of public indoor 3D scene reconstruction datasets ScanNet, ScanNet++, and ARKitScenes. VSI-Bench includes eight tasks under three task types: configurational, measurement estimation, and spatiotemporal. See Fig. 2 for an overview of the tasks in VSI-Bench and Fig. 3 for dataset statistics. Iteratively refined for quality, VSI-Bench provides a foundation to study the connection between MLLMs and 3D reconstruction.

benchmark category — **Figure 2: Tasks demonstration of VSI-Bench.** Note: the questions above are simplified slightly for clarity and brevity.

VSI-Bench Construction: We develop a robust pipeline to construct VSI-Bench that enables high-quality question-answer (QA) pair generation at scale. Starting with data collection and unification, we standardize diverse 3D indoor scene datasets into a unified meta-information format, incorporating object categories, bounding boxes, and video specifications to support dataset-agnostic QA generation. QA pairs are generated using automated annotations from meta-information and task-specific question templates, with route planning tasks manually annotated. To ensure quality, we implement a human-in-the-loop review process, iteratively refining question templates, annotations, and QA generation rules by addressing ambiguities and errors flagged by evaluators.

Evaluation on VSI-Bench

Evaluation Setup: We benchmark 15 video-supporting MLLMs from diverse model families. For proprietary models, we consider Gemini-1.5 and GPT-4o. For open-source models, we evaluate models from InternVL2, ViLA, LongViLA, LongVA, LLaVA-OneVision, and LLaVA-NeXT-Video. All evaluations are conducted in zero-shot settings with default prompts and greedy decoding for reproducibility. Tasks are evaluated using either Multiple-Choice Answer (MCA) accuracy or our proposed Mean Relative Accuracy (MRA) for Numerical Answer (NA) tasks.

$$\text{MRA} = \frac{1}{10} \sum_{\theta \in C} \mathbb{1}\left(\frac{| \hat{y} - y |}{y} < 1 - \theta\right)$$

Baselines include random selection and frequency-based answer selection to identify performance gains due to distribution biases. Additionally, human performance is assessed on a randomly sampled subset of 400 questions (VSI-Bench tiny), with metrics compared to Gemini-1.5 Pro.

Main Results: Human evaluators achieve an average accuracy of 79%, outperforming the best model by 33%, with near-perfect performance (94%-100%) on configuration and spatiotemporal tasks. However, the gap narrows on measurement tasks that require precise estimation, where MLLMs demonstrate relative strength in quantitative tasks. Among proprietary models, Gemini-1.5 Pro stands out, significantly exceeding chance baselines and approaching human performance in tasks like absolute distance and room size estimation, despite being trained only on 2D digital data. Top-performing open-source models, such as LLaVA-NeXT-Video-72B and LLaVA-OneVision-72B, achieve competitive results, trailing Gemini-1.5 Pro by just 4%-5%. However, most open-source models (7/12) fall below chance baselines, revealing notable deficiencies in visual-spatial intelligence.

Blind Evaluation: We compare MLLMs' performance against “Chance Level (frequency)” and “Vision Disabled” (blind) results, averaged across six top models (three open-source and three closed-source). The consistent improvements in “Enabled-Disabled” and the general degradation in “Disabled-Chance” highlight the importance of video input for VSI-Bench, as blind models perform worse than chance. However, MLLMs struggle to surpass chance level on tasks such as absolute distance estimation, route planning, and relative direction, reflecting the inherent difficulty of these tasks. Interestingly, “Vision Disabled” models significantly outperform chance on object size tasks, likely due to the integration of common-sense knowledge from language model training.

VSI-Bench Leaderboard

To include your model in the leaderboard, please email jihanyang13@gmail.com with evaluation logs and setups.

Model	LLM Params	Frames	Date	Avg.	Numerical Answer Questions				Multiple-Choice Questions
Model	LLM Params	Frames	Date	Avg.	Obj. Count	Abs. Dist.	Obj. Size	Room Size	Rel. Dist.	Rel. Dir.	Route Plan	Appear. Order
Baselines
Chance-level (Random)	-	-	2024-11-15	-	-	-	-	-	25.0	36.1	28.3	25.0
Chance-level (Frequency)	-	-	2024-11-15	34.0	62.1	32.0	29.9	33.1	25.1	47.9	28.4	25.2
Human-Level	-	-	2024-11-15	79.2	94.3	47.0	60.4	45.9	94.7	95.8	95.8	100.0
MLLMs
Gemini-1.5-Pro Proprietary	-	-	2024-11-15	48.8	49.6	28.8	58.6	49.4	46.0	48.1	42.0	68.0
InternVL3-78B Open	78B	64	2025-04-19	48.4	71.2	53.7	44.4	39.5	55.9	39.5	28.9	54.5
GPT-4o Proprietary	-	64	2025-04-03	47.8	43.1	34.1	68.6	64.2	48.3	43.1	29.4	51.3
Gemini-1.5 Flash Proprietary	-	-	2024-11-15	45.7	50.8	33.6	56.5	45.2	48.0	39.8	32.7	59.2
Gemini-2.0 Flash Proprietary	-	-	2024-12-20	45.4	52.4	30.6	66.7	31.8	56.0	46.3	24.5	55.1
LLaVA-Video-72B Open	72B	32	2024-11-24	40.9	48.9	22.8	57.4	35.3	42.4	36.7	35.0	48.6
LLaVA-OneVision-72B Open	72B	32	2024-11-27	40.2	43.5	23.9	57.6	37.5	42.5	39.9	32.5	44.6
InternVL2-8B Open	8B	32	2024-11-17	37.5	31.3	29.0	48.9	44.2	38.0	33.4	28.9	46.4
InternVL2-40B Open	40B	32	2024-11-18	37.0	41.3	26.2	48.2	27.5	47.6	32.7	27.8	44.7
LLaVA-Video-7B Open	7B	32	2024-11-23	35.6	48.5	14.0	47.8	24.2	43.5	42.4	34.0	30.6
GPT-4o Proprietary	-	16	2024-11-15	34.0	46.2	5.3	43.8	38.2	37.0	41.3	31.5	28.5
LLaVA-OneVision-7B Open	7B	32	2024-11-26	32.4	47.7	20.2	47.4	12.3	42.5	35.2	29.4	24.4
VILA-1.5-40B Open	40B	32	2024-11-21	31.2	22.4	24.8	48.7	22.7	40.5	25.7	31.5	32.9
LongVA-7B Open	7B	32	2024-11-22	29.2	38.0	16.6	38.9	22.2	33.1	43.3	25.4	15.7
VILA-1.5-8B Open	8B	32	2024-11-20	28.9	17.4	21.8	50.3	18.8	32.1	34.8	31.0	24.8
LLaVA-OneVision-0.5B Open	0.5B	32	2024-11-25	28.0	46.1	28.4	15.4	28.3	28.9	36.9	34.5	5.8
InternVL2-8B Open	8B	32	2024-11-16	26.5	25.7	24.0	20.0	29.2	32.1	44.1	30.4	6.3
SmoVLM2-2.2B Open	2.2B	32	2025-06-26	26.5	25.7	27.1	30.9	14.2	33.2	42.3	33.0	5.3
SmoVLM2-500M Open	500M	32	2025-06-26	25.8	8.0	29.7	30.4	27.5	28.2	33.9	26.8	21.7
LongVILA-8B Open	8B	32	2024-11-19	21.6	29.1	9.1	16.7	0.0	29.6	30.7	32.5	25.5
SmoVLM2-256M Open	256M	32	2025-06-26	21.4	50.5	1.8	1.3	0.0	30.3	36.4	25.8	25.1

How MLLMs Think in Space Linguistically

To better understand when and why models succeed or fail and to elucidate the facets of visual-spatial intelligence they possess, we examine how MLLMs think in space linguistically.

Case Studies: In the success example, the model demonstrates advanced video understanding with accurate timestamped descriptions and a correct step-by-step reasoning process. The use of a global coordinate system suggests that MLLMs may construct implicit world models by integrating spatial context and reasoning. In the error case, the model fails in egocentric-allocentric transformation, incorrectly interpreting a video sequence due to reliance on the egocentric view, leading to a flawed spatial inference.

Error Analysis: Analysis of errors from the best-performing MLLM on VSI-Bench (tiny) identifies four main error types: visual perception, linguistic intelligence, relational reasoning, and egocentric-allocentric transformation. Figure 6 reveals that 71% of errors stem from spatial reasoning, particularly in understanding distance, size, and direction. This indicates that spatial reasoning remains the key bottleneck for improving MLLM performance on VSI-Bench.

                    Finding 1: Spatial reasoning is the primary bottleneck for MLLM performance on VSI-Bench.
                

Limits of CoT Methods in Visuospatial Tasks: We investigate three prompting techniques—Zero-Shot Chain-of-Thought (CoT), Self-Consistency with CoT, and Tree-of-Thoughts (ToT)—to improve MLLM reasoning on VSI-Bench. Surprisingly, all three methods led to performance degradation (see Fig. 8), with Zero-Shot CoT and ToT reducing average performance by 4%, and Self-Consistency falling 1.1% below the baseline. While the appearance order and absolute distance estimation tasks saw slight improvements due to reduced linguistic errors, the room size and object size tasks suffer a large 8% to 21% decrease, showing that encouraging a model to think more can be not just unreliable but downright harmful.

Meanwhile, as shown in Tab. 2, ZeroShot CoT achieves a 1.6% improvement on the general video understanding benchmark VideoMME.

Case	Performance
Gemini-1.5 Pro (w/o CoT)	77.2
Gemini-1.5 Pro (w/ CoT)	79.8

Table 2: Gemini-1.5 Pro CoT performance on a 500-questions subset in VideoMME.

                    Finding 2: Linguistic prompting techniques, although effective in language reasoning and general visual tasks, are primarily harmful for spatial reasoning.
                

Case	Rel. Dist Acc.
w/o Cog. map	46.0
w/ Cog. map	56.0
w/ Cog. map (GT)	66.0

Cog. Map Src.	Size	Rel. Dist Acc.
MLLM	10 × 10	56.0
MLLM	20 × 20	54.0
GT	10 × 10	66.0
GT	20 × 20	78.0

Thinking in Space

How Multimodal Large Language Models See, Remember and Recall Spaces

Authors

Affiliations

Date

VSI-Bench

Evaluation on VSI-Bench

VSI-Bench Leaderboard

How MLLMs Think in Space Linguistically

How MLLMs Think in Space Visually

Conclusion

BibTeX