arxiv: 2604.14799 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.CV

Recognition: unknown

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

Nishanth Madhusudhan , Vikas Yadav , Alexandre Lacoste

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:39 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords abstentionmultimodal reasoningvision-language modelsmulti-agent systemsunanswerable questionsMM-AQAevidence sufficiency

0 comments

The pith

Multimodal models rarely abstain from answering when evidence is insufficient and require abstention-aware training rather than better prompting or additional agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how vision-language models and multi-agent systems handle cases where multimodal questions lack sufficient visual or textual evidence. It introduces the MM-AQA benchmark to generate unanswerable questions from answerable ones by altering visual dependency and evidence levels. Tests on frontier models show that standard prompting produces almost no abstentions, while multi-agent setups increase abstention but reduce accuracy. Sequential agent designs perform similarly to iterative ones, pointing to confidence miscalibration as the main issue. The authors conclude that dedicated training for recognizing evidence gaps is needed instead of relying on prompting or scaling agents.

Core claim

Under standard prompting, vision-language models rarely abstain even when image or text evidence is absent or contradictory. Multi-agent systems raise abstention rates but create an accuracy trade-off, with sequential architectures matching or exceeding iterative ones. Models abstain readily when evidence is missing yet attempt to reconcile degraded or conflicting evidence. Effective multimodal abstention therefore depends on abstention-aware training rather than improved prompting or more agents.

What carries the argument

The MM-AQA benchmark, which creates unanswerable instances from answerable ones through controlled transformations along visual modality dependency and evidence sufficiency.

If this is right

Frontier VLMs will continue to answer rather than abstain on unanswerable questions under ordinary prompting.
Multi-agent systems can raise abstention rates but will lower overall accuracy unless the trade-off is addressed.
Sequential multi-agent designs are sufficient for abstention gains, so added iterative reasoning depth is not required.
Models will abstain when evidence is absent but will attempt answers when evidence is present yet degraded or contradictory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Calibration techniques developed for text-only abstention could be tested directly on multimodal inputs to reduce the observed trade-off.
Deployed systems could track abstention frequency on live data as a practical reliability signal.
The benchmark approach could be applied to other modalities such as audio or video to check whether the same abstention patterns hold.
Hybrid training objectives that jointly optimize accuracy and appropriate abstention might lessen the need for post-hoc prompting adjustments.

Load-bearing premise

The synthetic transformations that turn answerable questions into unanswerable ones produce failure modes that match how models actually meet insufficient evidence in practice.

What would settle it

Run the same models on a collection of naturally occurring unanswerable multimodal questions gathered from real-world sources and compare their abstention rates and error patterns to the results on the transformed benchmark.

Figures

Figures reproduced from arXiv: 2604.14799 by Alexandre Lacoste, Nishanth Madhusudhan, Vikas Yadav.

**Figure 1.** Figure 1: Overview of MM-AQA: (A) Benchmark construction: answerable instances are transformed into unanswerable counterparts along two axes, then filtered by a Dual-Consensus VLM QC module and human annotators, yielding 2079 samples. (B) Evaluation framework: standalone VLM and MAS are evaluated under a 3 × 2 - condition × clause design; responses are categorised by a five-way confusion matrix and four metrics are … view at source ↗

**Figure 2.** Figure 2: Taxonomy and transformation distribution for unanswerable samples across both benchmarks. Left: Abstain-MMMU; Right: Abstain-MMLongBench-Doc 4 MM-AQA Benchmark MM-AQA (MultiModal - Abstention Question Answering) comprises 2,079 samples for systematically evaluating abstention in multimodal reasoning systems. It has two subsets: Abstain-MMLongBench-Doc (A-MMLBD, 1526 samples), and Abstain-MMMU (AMMMU, 55… view at source ↗

**Figure 3.** Figure 3: A walkthrough example of the MM-AQA pipeline, illustrated via Abstain-MMMU curation; Abstain-MMLongBench-Doc follows the same process. Temporal) for evidence-aware routing. Transformation categories are: (1) Evidence Removal (OCR-guided masking, page-level hiding) (2) Evidence Corruption (numeric perturbation, table column distortion) (3) Semantic Contradiction (LLM-generated contradictory captions, unit/c… view at source ↗

**Figure 4.** Figure 4: Abstention performance across all evaluated configurations on MM-AQA (avg MCC across A-MMMU and A-MMLBD; baselines at oracle τ). Markers denote system type: circles = standalone VLM/ baseline, squares = MAS-Sequential, diamonds = MAS-Iterative. Colors denote model: green = Qwen 2.5-32B-VL, orange = GPT-5, purple = Claude Sonnet 4.5, blue = confidence and reasoning baselines, gray = degenerate anchors. The … view at source ↗

**Figure 5.** Figure 5: Transformation Category - Missing Visual Info. Transformation type - Aggressive Multi Mask Left: Original MMMU Sample; Right: Transformed Sample [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Transformation Category - Occlusion Ambiguity. Transformation type - Strong Blur with Edge Mask Left: Original MMMU Sample; Right: Transformed Sample example, we show: (i) the original answerable instance, (ii) the transformed unanswerable instance with transformation labeled, and (iii) why the transformed instance is genuinely unanswerable from the available evidence. K.1 Abstain-MMMU Examples Category 1:… view at source ↗

**Figure 7.** Figure 7: Transformation Category - Evidence Removal. Transformation type - Cross Page Evidence Page. Original question: “What is the performance of the InstructGPT model with Self-Ask in the closed-book setting on the dataset with the highest ProgramFC retrieval recall at 10? Please write down the answer in float format with 1 decimal.” Transformation: Corrupt Numeric Token - Corrupts majority of the numbers in the… view at source ↗

**Figure 6.** Figure 6: An error case from the HOVER 4-hop dataset where the generated reasoning program has an incorrect [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 8.** Figure 8: Transformation Category - Evidence Corruption. Transformation type - Corrupt Numeric Tokens. Left: Original MMLBD Sample; Right: Transformed Sample Why unanswerable: The transformed question asks about a future time period, making it unanswerable. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-AQA shows VLMs and MAS rarely abstain properly on unanswerable multimodal questions, but the synthetic transformations lack validation against real failure modes.

read the letter

This paper introduces MM-AQA, a benchmark that turns answerable multimodal questions into unanswerable ones through targeted transformations along visual dependency and evidence sufficiency axes. It then runs three frontier VLMs and two multi-agent setups across 2079 samples to measure abstention behavior. The main results are straightforward: standard prompting produces almost no abstention, simple confidence baselines do better, multi-agent systems raise abstention rates but lower accuracy, sequential agent flows match iterative ones, and models abstain more readily on missing evidence than on degraded or contradictory evidence. The work extends text-only abstention studies into the multimodal setting with concrete numbers and clear comparisons. That is the useful part. The soft spot is the benchmark construction itself. The transformations are reasonable on paper, yet the study provides no human validation or comparison to actual deployment logs or user queries to confirm these cases match how insufficient evidence shows up in practice. Without that check, the conclusion that abstention-aware training is required over better prompting or more agents rests on untested ground. The evaluations and citation pattern look standard for an empirical benchmark paper with no obvious fitting or circularity issues. This is aimed at researchers working on reliable vision-language systems and calibration. Readers who need a testbed for abstention behavior will find the data and setup worth examining. It has enough new empirical content to deserve peer review, though referees should focus on whether the unanswerable instances generalize. I would send it for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces MM-AQA, a benchmark that generates unanswerable multimodal QA instances from answerable ones via transformations along visual modality dependency and evidence sufficiency axes. It evaluates three frontier VLMs (closed and open-source) and two MAS architectures on 2079 samples, reporting that standard prompting yields rare abstention (outperformed by confidence baselines), MAS improves abstention at the cost of an accuracy trade-off, sequential MAS matches or exceeds iterative variants, and models abstain on absent evidence but reconcile on degraded/contradictory evidence. The central claim is that effective multimodal abstention requires abstention-aware training rather than improved prompting or additional agents.

Significance. If the benchmark transformations produce realistic failure modes, the work offers a useful empirical demonstration of current limitations in VLM and MAS abstention behavior, supported by concrete numbers across multiple models and 2079 samples. This strengthens the case for shifting focus toward training-based solutions in reliable multimodal systems. The broad model coverage and explicit comparison of prompting vs. MAS vs. baselines are positive empirical contributions.

major comments (2)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The transformations along the two axes are used to create the unanswerable instances that underpin all four findings and the abstract's central claim, yet no validation is provided (e.g., human annotation of real user queries, comparison to deployment logs, or distributional analysis) showing these synthetic cases match naturally occurring insufficient-evidence scenarios. This directly affects generalizability of the conclusion that prompting and MAS are inadequate.
[§4–5 (Results and Analysis)] §4–5 (Results and Analysis): The reported findings lack per-transformation error breakdowns or detailed analysis of the 2079 samples, making it difficult to verify support for claims such as 'models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence' and the accuracy-abstention trade-off in MAS.

minor comments (2)

[Abstract and §1] The abstract and §1 would benefit from explicit mention of the exact model names, the answerable/unanswerable split within the 2079 samples, and the precise definition of 'abstention' used in scoring.
[Figures/Tables (MAS evaluation)] Figure or table captions for the MAS architectures could clarify the distinction between sequential and iterative designs to aid readers in interpreting the result that sequential variants perform comparably.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: §3 (Benchmark Construction): The transformations along the two axes are used to create the unanswerable instances that underpin all four findings and the abstract's central claim, yet no validation is provided (e.g., human annotation of real user queries, comparison to deployment logs, or distributional analysis) showing these synthetic cases match naturally occurring insufficient-evidence scenarios. This directly affects generalizability of the conclusion that prompting and MAS are inadequate.

Authors: We acknowledge that the benchmark relies on synthetic transformations without direct empirical validation against real-world unanswerable queries or deployment data. The two axes (visual modality dependency and evidence sufficiency) were chosen to systematically instantiate common multimodal failure modes documented in prior VLM literature, such as missing visual evidence or insufficient/contradictory support. In the revised manuscript, we will expand Section 3 with an explicit limitations subsection discussing the synthetic nature of the data, the design rationale for each transformation, and the need for future human-validated or log-based benchmarks. This qualifies the generalizability of our conclusions without altering the reported empirical results on the constructed instances. revision: partial
Referee: §4–5 (Results and Analysis): The reported findings lack per-transformation error breakdowns or detailed analysis of the 2079 samples, making it difficult to verify support for claims such as 'models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence' and the accuracy-abstention trade-off in MAS.

Authors: We agree that finer-grained breakdowns would improve verifiability of the claims. The current analysis reports aggregate statistics and provides qualitative examples illustrating abstention on absent evidence versus reconciliation attempts on degraded or contradictory evidence. In the revision, we will add per-transformation tables (in the main text or appendix) showing abstention rates, accuracy, and error types for each level of the two axes across the 2079 samples. This will directly support the specific behavioral claims and provide additional detail on the MAS accuracy-abstention trade-off. The underlying data supports these computations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluations

full rationale

The paper is an empirical study that constructs the MM-AQA benchmark via explicit transformations along visual dependency and evidence sufficiency axes, then reports direct model evaluations on 2079 samples across VLMs and MAS architectures. No equations, derivations, fitted parameters, or predictions appear in the provided text. Central claims rest on observed abstention rates, accuracy-abstention trade-offs, and qualitative behaviors under different prompting and agent setups. These are measured outcomes, not reductions of any result to its own inputs by construction. Self-citations, if present, are not load-bearing for the benchmark construction or findings. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions in AI benchmarking and evaluation; no free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)

standard math Standard prompting and evaluation metrics for accuracy and abstention rate apply to multimodal models.
Used to compare model behavior under different conditions.

pith-pipeline@v0.9.0 · 5522 in / 1128 out tokens · 45102 ms · 2026-05-10T11:39:53.582055+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.

Reference graph

Works this paper leans on

24 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Mathew, D

Association for Computational Linguistics. URL https://aclanthology.org/2025. coling-main.627/. Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images, 2021. URLhttps://arxiv.org/abs/2007.00398. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. InACL, 20...

work page arXiv 2025
[2]

Aggressive crop: removes the outermost 50% from a randomly selected edge (top/bot- tom/left/right), eliminating axis labels, legends, or annotations depending on selection
[3]

Aggressive multi mask: 20 randomly-placed black rectangles, each covering approxi- mately 15% of image area; enforces≥70% total coverage
[4]

(2) Occlusion Ambiguity.Preserves image structure while blocking interpretability

Full overlay + Label mask: semi-transparent gray overlay (α≈0.6) applied to the full image, followed by targeted edge masking of detected titles, captions, axis labels, and legends. (2) Occlusion Ambiguity.Preserves image structure while blocking interpretability
[5]

Partial occlusion: 9 solid black bars (alternating horizontal and vertical, width W/10 toW/5) placed at regular intervals
[6]

Adaptive darkness: all pixel values divided by 15, corresponding to approximately 3.9 stop exposure reduction; image remains non-empty but visually irrecoverable
[7]

(3) Semantic Unanswerability.Modifies the question while leaving all images unchanged; requires non-trivial reasoning to detect because image content is intact

Blur with edge mask: Gaussian blur (radius 18) applied to ≈85% of the image, followed by edge masking targeting axis labels and figure titles. (3) Semantic Unanswerability.Modifies the question while leaving all images unchanged; requires non-trivial reasoning to detect because image content is intact
[8]

Image-question mismatch: question is rewritten to ask about visual content that is entirely a different phenomenon the image cannot address
[9]

Missing context: question is rewritten to require comparative, or extrinsic informa- tion unavailable from the image
[10]

Impossible visual task: question is rewritten to require precise measurements, hidden properties, or sub-pixel distinctions not feasible from the image
[11]

Nonexistent reference: a specific entity not present in the image is introduced into the question. All variants use constrained LLM rewriting with three validation gates: (i) output length within 120% of input; (ii) semantic divergence confirmed via embedding distance; (iii) grammatical structure check via dependency parse. 15 Preprint. Under review. (4) ...
[12]

Multiple correct answers: question is reworded to introduce a conflicting constraint or alternative interpretive convention, such as specifying that a given parameter may be treated under two distinct standards, such that two or more answers options become simultaneously defensible
[13]

Ambiguous reference: question is reinvented to ask about non-existent/ fabricated visual elements within the same domain
[14]

The resulting contradiction makes it impossible to derive a self-consistent solution

Contradictory premise: question is reworded to introduce a premise that is logically inconsistent with one or more of the given conditions or constraints. The resulting contradiction makes it impossible to derive a self-consistent solution. D.2 A-MMLBD Transformations Nine unique transformations organized into four families targeting evidence-structure di...
[15]

Remove visual element: random black rectangular masks are placed in and around the center of the evidence pages where visual evidence density is highest
[16]

If OCR based feature detection fails, a fallback blur is applied to the entire page

Blur figure region: a heavy Gaussian blur is applied to the figure or chart region on the evidence page, rendering its visual features, including axis labels, data trends, and color distinctions, unrecognizable. If OCR based feature detection fails, a fallback blur is applied to the entire page
[17]

(2) Evidence Corruption.Preserves document structure while altering the semantic content required for answering

Hide evidence page: 50% of the evidence pages of a question are hidden and the remaining 50% are vertically cropped (at midpoint). (2) Evidence Corruption.Preserves document structure while altering the semantic content required for answering
[18]

It leaves the surrounding context intact

Remove numeric token: numeric values appearing in a table, chart, or text region of evidence pages are surgically erased (along with random text tokens). It leaves the surrounding context intact
[19]

Corrupt numeric token: numeric values in a table, chart, or text region of evidence pages are surgically corrupted (along with erasure of random text tokens), introduc- ing a subtle factual error that cannot be detected from the question text alone
[20]

(3) Semantic Contradiction.Injects conflicting information to create ambiguity

Column merge ambiguity: two or more table columns are merged into a single column with an ambiguous composite headers, obscuring the schema structure required to correctly interpret the tabular data. (3) Semantic Contradiction.Injects conflicting information to create ambiguity
[21]

(4) Inferential Impossibility.Modifies the question scope to require information outside the document’s coverage

Contradictory caption: an LLM-generated note containing information that directly contradicts the question content is appended at the end. (4) Inferential Impossibility.Modifies the question scope to require information outside the document’s coverage
[22]

Counterfactual Distractor Rows: One or more plausible but factually incorrect rows are inserted into a table on the evidence pages, introducing distractor entries that share structural similarity with the ground-truth row
[23]

unanswerable

Temporal Drift: the temporal reference in the question, such as a fiscal year or reporting period, is shifted to a time period outside the range covered by the document. 16 Preprint. Under review. E Transformation Selection, Balancing, and Preliminary Verification Scoring and routing.From both benchmark subsets, each transformation is scored against the q...

2023
[24]

How many claims are with the highest percentage of reasoning steps in the author’s proposed dataset?

The provided dynamics are not applicable to the moment being asked about. Category 4: Adversarial Ambiguity In this category, images are not transformed. Subject:Geography. Original question: “There is a square five-pile cap as shown in Figure 4-36. The known conditions are as follows: the cushion cap thickness is 1.2 m, the effective height is h0 = 1050 ...

work page arXiv 2015