TsT: Test-Set Stress-Test

Abstract

Robust multimodal benchmarks are the foundation of progress in Multimodal Large Language Models (MLLMs). Yet we show that many modern “vision-centric” benchmarks can be aced by models that ignore the visual input entirely, simply by exploiting non-visual shortcuts encoded in the test set's questions, answer distributions, and templates.

We advocate a simple principle for benchmark design: if a benchmark can be gamed, it will be. Rather than hoping blind baselines remain low, we propose that benchmark designers should actively train on the test set to probe its intrinsic vulnerabilities.

We introduce the Test-set Stress-Test (TsT), which uses \(k\)-fold cross-validation on non-visual test-set information to (i) estimate a benchmark's global non-visual solvability and (ii) assign sample-level bias scores \(s(x)\). We instantiate TsT with a powerful LLM-based diagnostic (TsT-LLM) and an interpretable Random Forest diagnostic (TsT-RF). Finally, we propose Iterative Bias Pruning (IBP), which uses TsT's bias scores to iteratively filter high-bias samples and construct more robust datasets.

Applied to four widely used benchmarks—VSI-Bench, CV-Bench, MMMU, and VideoMME—TsT reveals pervasive non-visual shortcuts, including gains of more than +30 percentage points in blind accuracy by learning test-set patterns alone. As a case study, we use TsT + IBP to create a debiased variant of VSI-Bench that substantially widens the vision-blind performance gap and better compels genuine visual reasoning.

The Problem

Many multimodal benchmarks hide a dirty secret: models can achieve high scores without even using the visual input. We call this "blind" evaluation—giving the model the question without the image.

This is particularly problematic for vision-centric benchmarks, which explicitly aim to require visual inputs to be solved. If a model can answer correctly by guessing "2" because 50% of the answers are "2", are we really measuring visual understanding?

Severe answer skew. Always guessing '2' works well.

Key Principle

"If a benchmark CAN be gamed, it WILL be."

Therefore, multimodal benchmark designers should proactively try to "game" their own benchmarks first as a key step in the development lifecycle.

⓵ Diagnose

Introducing TsT

Test-Set Stress-Test (TsT) is a diagnostic framework that quantifies non-visual exploitability.

We use k-fold cross-validation DIRECTLY on the test set (using only non-visual text inputs). This allows us to train diagnostic models that:

Quantify global non-visual solvability
Compute per-sample bias scores s(x)

This effectively "trains on the test set" not to overfit, but to adversarially probe for intrinsic vulnerabilities.

We propose two complementary diagnostic approaches, trained on non-visual features of the test set:

TsT-LLM: Uses LoRA-tuned LLMs for detection power and wide applicability.
TsT-RF: Uses Random Forests for interpretability and speed.

Why the test set specifically?

You might ask: "Is it actually crucial to train on the test set specifically? Why not just use a held-out training set?"

YES! Once a benchmark is released, models can exploit the SPECIFIC artifacts present in THAT test set—idiosyncrasies of sampling, templates, or curation.

Training on a held-out "train" set only captures general domain biases (blue region). To find the shortcuts that break your specific benchmark, you must stress-test the actual test instrument itself (pink region).

Diagnosis: Widespread Vulnerabilities

We applied TsT to some leading multimodal benchmarks, exposing notable non-visual shortcuts.

+33%

CV-Bench Blind Gain

Blind accuracy jumped from 40% to 73% using TsT.

+31%

VSI-Bench Blind Gain

Blind accuracy jumped from 25% to 56% using TsT.

+9%

MMMU Blind Gain

Blind accuracy improved from 35% to 44%.

+6%

VideoMME Blind Gain

Blind accuracy improved from 35% to 42%.

⓶ Debias

Iterative Bias Pruning (IBP)

Identifying the problem isn't enough. We need to fix it. We use TsT's bias scores to systematically identify and remove the most broken samples.

Iterative Bias Pruning (IBP) alternates between:

Pruning the top-k most biased samples
Re-calculating bias scores via TsT

The result? A debiased benchmark subset that is much harder to game via shortcuts.

IBP Algorithm: Systematic Debiasing

Case Study: VSI-Bench-Debiased

By applying the full TsT+IBP procedure to VSI-Bench, we created a debiased version of VSI-Bench where visual understanding is more crucial to score well.

34%

Wider Gap

VSI-Bench-Debiased shows a much larger vision-blind performance gap.

Model Configuration	Vision	Blind	Gap (V-B)
Original VSI-Bench
LLaVA-Video 7B (Base)	36.7	25.9	10.8
+ VSI-Train-10k FT	57.1	44.7	12.4
VSI-Bench-Debiased (Ours)
LLaVA-Video 7B (Base)	31.3	20.3	11.0
+ VSI-Train-10k FT	48.7	32.0	16.6

On the debiased benchmark, the blind model struggles significantly more (44.7% → 32.0%), proving that shortcuts have been removed.