TsT helps multimodal benchmark designers diagnose and mitigate non-visual shortcuts in their benchmarks.
Robust multimodal benchmarks are the foundation of progress in Multimodal Large Language Models (MLLMs). Yet we show that many modern “vision-centric” benchmarks can be aced by models that ignore the visual input entirely, simply by exploiting non-visual shortcuts encoded in the test set’s questions, answer distributions, and templates.
We advocate a simple principle for benchmark design: if a benchmark can be gamed, it will be. Rather than hoping blind baselines remain low, we propose that benchmark designers should actively train on the test set to probe its intrinsic vulnerabilities.
We introduce the Test-set Stress-Test (TsT), which uses \(k\)-fold cross-validation on non-visual test-set information to (i) estimate a benchmark’s global non-visual solvability and (ii) assign sample-level bias scores \(s(x)\). We instantiate TsT with a powerful LLM-based diagnostic (TsT-LLM) and an interpretable Random Forest diagnostic (TsT-RF). Finally, we propose Iterative Bias Pruning (IBP), which uses TsT’s bias scores to iteratively filter high-bias samples and construct more robust datasets.
Applied to four widely used benchmarks—VSI-Bench, CV-Bench, MMMU, and VideoMME—TsT reveals pervasive non-visual shortcuts, including gains of more than +30 percentage points in blind accuracy by learning test-set patterns alone. As a case study, we use TsT + IBP to create a debiased variant of VSI-Bench that substantially widens the vision–blind performance gap and better compels genuine visual reasoning.
Modern multimodal benchmarks have evolved from tightly controlled visual tasks to open-ended question-answering over images and videos. This increased expressivity comes with a hidden cost: it is much harder to know what is actually being measured. A model can score highly by exploiting world knowledge and textual regularities, without truly understanding the visual content.
We focus on non-visual shortcuts: cases where questions can be answered correctly without using the visual input at all. These shortcuts can come from natural world knowledge (e.g., "fridges are usually around 170–180cm tall"), or from statistical quirks of the benchmark (e.g., certain answers appearing disproportionately often, or specific templates almost always mapping to the same label). Either way, when the goal is to measure visual understanding, such patterns undermine evaluation.
To make this concrete, here are four types of statistical biases we discovered across real benchmarks. These patterns enable models to achieve high accuracy without visual reasoning:
The first category of shortcuts comes from world knowledge embedded in LLMs during pretraining. As shown below, benchmarks like MMMU benefit more from scaling the LLM backbone than from enabling vision, suggesting they rely heavily on linguistic knowledge. In contrast, VSI-Bench shows negligible gains from LLM scaling in blind settings but substantial improvements when vision is enabled—demonstrating greater robustness to knowledge-based shortcuts.
Not every pattern in a dataset is a shortcut. The key question is not where a pattern comes from (world statistics vs. annotation artifacts), but what effect it has on the evaluation. If a model can exploit a pattern to answer correctly without using the visual signal, then for a vision-centric benchmark, that pattern is a problem.
For example, consider questions like "Which item is closest to the bed?" where "lamp" happens to be the correct answer far more often than chance. Even if this reflects some real-world regularity, in a benchmark that is supposed to probe spatial reasoning, it lets models answer correctly by leaning on text-only priors rather than the actual image.
Litmus test: if a blind model can reach high accuracy on a “vision” benchmark, its scores no longer reliably reflect visual understanding.
We distinguish between two failure modes:
TsT specifically targets evaluation failures: it asks how much of the test set can be solved by learning patterns in the test questions and answers alone.
At a high level, TsT performs \(k\)-fold cross-validation directly on the benchmark’s test set, using only non-visual information (text, metadata, templates). For each fold, we train a blind diagnostic model on the remaining folds and evaluate it on the held-out fold. Every test example is thus predicted by a model that has not seen that example during training, but has seen the rest of the test set.
TsT produces two key outputs:
We instantiate TsT with two complementary diagnostics here.
TsT-LLM uses a strong language model (e.g., Qwen2.5-7B) as the diagnostic. For each fold, we LoRA-tune the LLM on question-only inputs from the training folds and evaluate on held-out questions. This requires no hand-designed features and can capture both simple statistical patterns and complex knowledge-based shortcuts.
On template-based benchmarks like CV-Bench and VSI-Bench, TsT-LLM dramatically increases blind accuracy: from 40.1 → 73.4 on CV-Bench and 25.0 → 56.4 on VSI-Bench, revealing +33.3 and +31.4 point gains purely from learning test-set text. Even on more heterogeneous benchmarks like MMMU and VideoMME, TsT-LLM finds sizeable gains of +8.6 and +6.4 points.
TsT-RF uses a Random Forest classifier trained on lightweight, human-interpretable features (e.g., answer frequencies, template IDs, question length, lexical indicators). While less expressive than TsT-LLM, it is CPU-friendly and provides direct insight into which patterns the diagnostic is exploiting, via feature importances.
Together, TsT-LLM and TsT-RF deliver both strong detection of shortcut behavior and actionable explanations of how benchmark structure contributes to non-visual solvability.
TsT does more than say “your benchmark has shortcuts”. Its sample-level bias scores \(s(x)\) provide a ranking of which questions are most vulnerable. Iterative Bias Pruning (IBP) turns this into a systematic procedure for improving a benchmark.
Concretely, IBP:
IBP is agnostic to the specific diagnostic (TsT-LLM or TsT-RF) and to the mitigation action (pruning, rewriting, rebalancing). In this work, we focus on pruning as a proof-of-concept.
As a concrete demonstration, we apply TsT + IBP to VSI-Bench, a spatial reasoning benchmark. TsT-LLM shows that a blind model can gain over 30 points of accuracy by training on test-set questions alone, indicating strong non-visual shortcuts.
IBP uses TsT-RF bias scores to prune shortcut-prone questions and produces a VSI-Bench-Debiased variant. We then re-evaluate LLaVA-Video-7B before and after fine-tuning on additional in-distribution data:
| Vis. | Blind | \(\Delta_{V-B}\) | Vis. (Debiased) | Blind (Debiased) | \(\Delta_{V-B}\) (Debiased) | |
|---|---|---|---|---|---|---|
| LLaVA-Video 7B (base) | 36.7 | 25.9 | 10.8 | 31.3 | 20.3 | 11.0 |
| + VSI-Train-10k FT | 57.1 | 44.7 | 12.4 | 48.7 | 32.0 | 16.6 |
This case study illustrates TsT’s full lifecycle: diagnose non-visual shortcuts, compute sample-level bias scores, prune the worst offenders, and re-evaluate to confirm that visual reasoning, not text-only priors, drives progress.
Beyond VSI-Bench, TsT reveals pervasive shortcuts across three additional benchmarks: CV-Bench, MMMU, and VideoMME. In each case, TsT-LLM significantly improves blind accuracy simply by training on the test questions and answers.
TsT-LLM results: blind accuracy climbs from 40.1 → 73.4 on CV-Bench and 25.0 → 56.4 on VSI-Bench, with additional gains of +8.6 on MMMU and +6.4 on VideoMME — all without using any visual input.
These findings highlight that shortcut behavior is not an isolated issue in any single dataset, but a structural risk across diverse benchmark designs, including template-based, human-authored, and LLM-generated questions.
TsT is meant to be a practical tool for anyone designing or maintaining multimodal benchmarks. From our analysis, we propose a set of actionable guidelines:
Benchmark designers should “train on the test set” — not to inflate scores, but to adversarially audit evaluation instruments and ensure that reported progress reflects genuine multimodal understanding.
@article{brown2025shortcuts,
author = {Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
title = {Benchmark Designers Should ``Train on the Test Set'' to Expose Exploitable Non-Visual Shortcuts},
journal = {arXiv preprint arXiv:2511.04655},
year = {2025}
}