VSTAT

Benchmarking Visual State Tracking in Multimodal Video Understanding

Sihyun Yu^1,2†* Nanye Ma^1†* Pinzhi Huang^1†* Hyunseok Lee^2* Shusheng Yang¹ June Suk Choi² Ellis Brown¹ Oscar Michel¹ Boyang Zheng¹ Jinwoo Shin² Saining Xie¹

¹New York University ²KAIST

† Project lead. ∗ Equal technical contribution.

Leaderboard

Ranking

Top models settle in the low 40s while humans nearly solve the benchmark — a gap of 46 points that current MLLMs have yet to close.

Human 90.5 ceiling, n=92

Best model 44.4 Gemini-3.1 Pro (low)

Chance level 37.8 frequency baseline

To include your model in the leaderboard, please email pinzhihuang23@gmail.com with evaluation logs and setups.

Benchmark

What is VSTAT?

We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 video clips paired with 1,500 questions drawn from synthetic, self-recorded, and real-world videos in the wild that contain procedural processes.

Each task is constructed so that the answer cannot be read off any single keyframe or a few salient moments: critical events may be hidden, visually similar to each other, or distributed across multiple entities and moments. Thus, models must continuously perceive and integrate events throughout the entire video stream. The tasks in VSTAT vary in the complexity of state to be tracked, and exhibit various perceptual challenges in extracting state from video; for instance, in the Rubik’s cube task, the model must track a specific cubie even when it is occluded in some frames.

Surprisingly, while these questions can be easily answered by humans, state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines.

834 video clips

1,500 questions

3 state elements

4 state structures

Examples

A glimpse across 39 tasks.

One representative clip per task, drawn from synthetic Blender renders, in-the-wild YouTube footage, and our own recordings. Click a card to see the question, options, and the ground-truth answer.

Why current MLLMs struggle on VSTAT.

Beyond reporting numbers, we localize where the bottleneck lies. Across two complementary studies, visual perception — not textual reasoning — emerges as the dominant cause of failure.

This gap stems from visual perception, not reasoning.

We compare the same model under two parallel conditions: the rendered video versus a hand-written textual transcription of the same visible states and events. A large gap implies the bottleneck is perception, not reasoning.

Question

How many times did the Pink face touch the floor?

Input A

Video

The model watches the actual rendered video.

Input B

Text Transcription

The model reads a hand-written transcription of the same events.

The die starts with the following three visible faces: (Red, Green, Blue).

The die is then moved as follows: roll up, roll down, roll left, …

After each move, the three visible faces become: (Green, Pink, Blue), (Red, Green, Blue), (Blue, Green, Pink) …

(a) A controlled die-rolling task. The model is asked the same question, but receives either the rendered video or a hand-written textual transcription of the same visible states and events — never both. This isolates visual perception from textual reasoning.

Accuracy on video vs text input across durations — **(b)** Result on the die-rolling task. State-of-the-art MLLMs solve the task **perfectly from text**, yet drop toward chance on the matching video — even at 5 seconds, where context length is negligible. The gap widens as the video grows longer.

State-of-the-art MLLMs solve these tasks perfectly from text, yet drop toward random-guess on video — even at 5 seconds, where context length is negligible. The perceptual gap is so severe that we had to provide the transcriptions by hand — state-of-the-art MLLMs themselves fail to transcribe even these synthetic videos into text.

Visual perception appears to be a main bottleneck for models on VSTAT.

What perceptual challenges does VSTAT contain?

We consider six categories tied to factors that make video perception difficult, illustrated below with one representative clip per category.

When do current MLLMs fail to solve VSTAT?

Comparing each video with the model’s thinking trace, we identify three recurring failure modes. More than 50% of failures stem from event recognition, suggesting the dominant bottleneck of current MLLMs lies in low-level perception, rather than visual reasoning.

Error breakdown by state type and overall error breakdown — Error breakdown of the best-performing model (Gemini-3.1 Pro) by state type (left) and overall (right). Most failures are due to errors in either **event recognition** or **entity association**; **state-update** errors are relatively limited, likely due to the advanced reasoning of state-of-the-art MLLMs.

01 · Event Recognition

The model misreads what just happened.

Even for relatively straightforward events, the model can fail to recognize them. In a cup-shuffle case, a person swaps the center and right cups, but the model identifies the event as “the Left and Right cups are swapped”, leading to an incorrect final answer. In harder cases, it may hallucinate the entire event trace.

02 · Entity Association

It loses track of who did what.

The model often fails when the state requires consistent association with a specific entity among visually similar objects. In a volleyball example, all players wear identical uniforms; the model correctly identifies each ball-touch event but assigns a new random jersey number each time, even when the same player handles the ball repeatedly.

03 · State Update

It forgets to apply the change.

The model correctly recognizes events and tracks entities, yet fails to update the state needed for the answer. During a third swap, it correctly notes that “Center and Right cups are swapped” but keeps the target cup at Center instead of moving it to Right. Common when tracking continuous trajectories.

Neither thinking nor agents fix it.

Two natural fixes — more thinking and agentic frameworks — both fail to rescue performance. Thinking budgets typically hurt on VSTAT; agents stay near chance level.

More thinking, less accuracy.

Increasing the thinking budget from low to high (or enabling thinking for instruct models) consistently degrades accuracy on VSTAT. With higher perceptual complexity, a larger thinking budget appears to increase the likelihood of hallucination.

Model	Setting	Score	Δ
Gemini-3.1 Pro	low → high	44.4 → 43.9	−1.1%
Gemini-3.0 Flash	low → high	39.8 → 38.8	−2.5%
Qwen3VL-8B	w/o → w/	33.2 → 28.2	−15.1%
InternVL3.5-8B	w/o → w/	30.6 → 30.2	−1.3%

Agents stay near chance.

We evaluate one video agent (AVP) and two coding agents (Codex with GPT-5; Claude Code with Opus 4.7) on a small n=39 subset. All sit at near-chance level, despite their strong performance on text-based tasks — reinforcing that the bottleneck is visual perception.

Method	Avg.
Chance level (Frequency, n=39)	50.8
Gemini-3.1 Pro	52.6
Gemini-3.1 Pro + AVP (video agent)	43.6
Claude Code (Opus 4.7, coding agent)	37.6
Codex GPT-5 (xhigh, coding agent)	53.4

Coding agents typically spend considerable time and tokens to answer a question — a single question takes ~30 minutes on average, largely because they produce inconsistent intermediate results that confuse the model itself. Video agents commit too early to their initial observation, sampling the video at a fixed low frame-rate (typically 1 FPS) and synthesizing an answer from a single round of evidence collection without verification.

Citation

Reference the benchmark.

@article{vstat2026,
  title  = {Benchmarking Visual State Tracking in Multimodal Video Understanding}
  author = {Sihyun Yu and Nanye Ma and Pinzhi Huang and Hyunseok Lee and Shusheng Yang and June Suk Choi and Ellis Brown and Oscar Michel and Boyang Zheng and Jinwoo Shin and Saining Xie},
  year   = {2026},
  journal = {arXiv preprint arXiv:2606.03920},
}