...to expose and mitigate exploitable non-visual shortcuts.
Robust multimodal benchmarks are the foundation of progress in Multimodal Large Language Models (MLLMs). Yet we show that many modern “vision-centric” benchmarks can be aced by models that ignore the visual input entirely, simply by exploiting non-visual shortcuts encoded in the test set's questions, answer distributions, and templates.
We advocate a simple principle for benchmark design: if a benchmark can be gamed, it will be. Rather than hoping blind baselines remain low, we propose that benchmark designers should actively train on the test set to probe its intrinsic vulnerabilities.
We introduce the Test-set Stress-Test (TsT), which uses \(k\)-fold cross-validation on non-visual test-set information to (i) estimate a benchmark's global non-visual solvability and (ii) assign sample-level bias scores \(s(x)\). We instantiate TsT with a powerful LLM-based diagnostic (TsT-LLM) and an interpretable Random Forest diagnostic (TsT-RF). Finally, we propose Iterative Bias Pruning (IBP), which uses TsT's bias scores to iteratively filter high-bias samples and construct more robust datasets.
Applied to four widely used benchmarks—VSI-Bench, CV-Bench, MMMU, and VideoMME—TsT reveals pervasive non-visual shortcuts, including gains of more than +30 percentage points in blind accuracy by learning test-set patterns alone. As a case study, we use TsT + IBP to create a debiased variant of VSI-Bench that substantially widens the vision-blind performance gap and better compels genuine visual reasoning.
Many multimodal benchmarks hide a dirty secret: models can achieve high scores without even using the visual input. We call this "blind" evaluation—giving the model the question without the image.
This is particularly problematic for vision-centric benchmarks, which explicitly aim to require visual inputs to be solved. If a model can answer correctly by guessing "2" because 50% of the answers are "2", are we really measuring visual understanding?
"If a benchmark CAN be gamed, it WILL be."
Therefore, multimodal benchmark designers should proactively try to "game" their own benchmarks first as a key step in the development lifecycle.
Test-Set Stress-Test (TsT) is a diagnostic framework that quantifies non-visual exploitability.
We use k-fold cross-validation DIRECTLY on the test set (using only non-visual text inputs). This allows us to train diagnostic models that:
This effectively "trains on the test set" not to overfit, but to adversarially probe for intrinsic vulnerabilities.
We propose two complementary diagnostic approaches, trained on non-visual features of the test set:
You might ask: "Is it actually crucial to train on the test set specifically? Why not just use a held-out training set?"
YES! Once a benchmark is released, models can exploit the SPECIFIC artifacts present in THAT test set—idiosyncrasies of sampling, templates, or curation.
Training on a held-out "train" set only captures general domain biases (blue region). To find the shortcuts that break your specific benchmark, you must stress-test the actual test instrument itself (pink region).
We applied TsT to some leading multimodal benchmarks, exposing notable non-visual shortcuts.
Blind accuracy jumped from 40% to 73% using TsT.
Blind accuracy jumped from 25% to 56% using TsT.
Blind accuracy improved from 35% to 44%.
Blind accuracy improved from 35% to 42%.
Identifying the problem isn't enough. We need to fix it. We use TsT's bias scores to systematically identify and remove the most broken samples.
Iterative Bias Pruning (IBP) alternates between:
The result? A debiased benchmark subset that is much harder to game via shortcuts.
By applying the full TsT+IBP procedure to VSI-Bench, we created a debiased version of VSI-Bench where visual understanding is more crucial to score well.
VSI-Bench-Debiased shows a much larger vision-blind performance gap.
| Model Configuration | Vision | Blind | Gap (V-B) |
|---|---|---|---|
| Original VSI-Bench | |||
| LLaVA-Video 7B (Base) | 36.7 | 25.9 | 10.8 |
| + VSI-Train-10k FT | 57.1 | 44.7 | 12.4 |
| VSI-Bench-Debiased (Ours) | |||
| LLaVA-Video 7B (Base) | 31.3 | 20.3 | 11.0 |
| + VSI-Train-10k FT | 48.7 | 32.0 | 16.6 |
On the debiased benchmark, the blind model struggles significantly more (44.7% → 32.0%), proving that shortcuts have been removed.