Recognition: unknown
Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Pith reviewed 2026-05-10 11:39 UTC · model grok-4.3
The pith
Multimodal models rarely abstain from answering when evidence is insufficient and require abstention-aware training rather than better prompting or additional agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under standard prompting, vision-language models rarely abstain even when image or text evidence is absent or contradictory. Multi-agent systems raise abstention rates but create an accuracy trade-off, with sequential architectures matching or exceeding iterative ones. Models abstain readily when evidence is missing yet attempt to reconcile degraded or conflicting evidence. Effective multimodal abstention therefore depends on abstention-aware training rather than improved prompting or more agents.
What carries the argument
The MM-AQA benchmark, which creates unanswerable instances from answerable ones through controlled transformations along visual modality dependency and evidence sufficiency.
If this is right
- Frontier VLMs will continue to answer rather than abstain on unanswerable questions under ordinary prompting.
- Multi-agent systems can raise abstention rates but will lower overall accuracy unless the trade-off is addressed.
- Sequential multi-agent designs are sufficient for abstention gains, so added iterative reasoning depth is not required.
- Models will abstain when evidence is absent but will attempt answers when evidence is present yet degraded or contradictory.
Where Pith is reading between the lines
- Calibration techniques developed for text-only abstention could be tested directly on multimodal inputs to reduce the observed trade-off.
- Deployed systems could track abstention frequency on live data as a practical reliability signal.
- The benchmark approach could be applied to other modalities such as audio or video to check whether the same abstention patterns hold.
- Hybrid training objectives that jointly optimize accuracy and appropriate abstention might lessen the need for post-hoc prompting adjustments.
Load-bearing premise
The synthetic transformations that turn answerable questions into unanswerable ones produce failure modes that match how models actually meet insufficient evidence in practice.
What would settle it
Run the same models on a collection of naturally occurring unanswerable multimodal questions gathered from real-world sources and compare their abstention rates and error patterns to the results on the transformed benchmark.
Figures
read the original abstract
Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MM-AQA, a benchmark that generates unanswerable multimodal QA instances from answerable ones via transformations along visual modality dependency and evidence sufficiency axes. It evaluates three frontier VLMs (closed and open-source) and two MAS architectures on 2079 samples, reporting that standard prompting yields rare abstention (outperformed by confidence baselines), MAS improves abstention at the cost of an accuracy trade-off, sequential MAS matches or exceeds iterative variants, and models abstain on absent evidence but reconcile on degraded/contradictory evidence. The central claim is that effective multimodal abstention requires abstention-aware training rather than improved prompting or additional agents.
Significance. If the benchmark transformations produce realistic failure modes, the work offers a useful empirical demonstration of current limitations in VLM and MAS abstention behavior, supported by concrete numbers across multiple models and 2079 samples. This strengthens the case for shifting focus toward training-based solutions in reliable multimodal systems. The broad model coverage and explicit comparison of prompting vs. MAS vs. baselines are positive empirical contributions.
major comments (2)
- [§3 (Benchmark Construction)] §3 (Benchmark Construction): The transformations along the two axes are used to create the unanswerable instances that underpin all four findings and the abstract's central claim, yet no validation is provided (e.g., human annotation of real user queries, comparison to deployment logs, or distributional analysis) showing these synthetic cases match naturally occurring insufficient-evidence scenarios. This directly affects generalizability of the conclusion that prompting and MAS are inadequate.
- [§4–5 (Results and Analysis)] §4–5 (Results and Analysis): The reported findings lack per-transformation error breakdowns or detailed analysis of the 2079 samples, making it difficult to verify support for claims such as 'models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence' and the accuracy-abstention trade-off in MAS.
minor comments (2)
- [Abstract and §1] The abstract and §1 would benefit from explicit mention of the exact model names, the answerable/unanswerable split within the 2079 samples, and the precise definition of 'abstention' used in scoring.
- [Figures/Tables (MAS evaluation)] Figure or table captions for the MAS architectures could clarify the distinction between sequential and iterative designs to aid readers in interpreting the result that sequential variants perform comparably.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the work.
read point-by-point responses
-
Referee: §3 (Benchmark Construction): The transformations along the two axes are used to create the unanswerable instances that underpin all four findings and the abstract's central claim, yet no validation is provided (e.g., human annotation of real user queries, comparison to deployment logs, or distributional analysis) showing these synthetic cases match naturally occurring insufficient-evidence scenarios. This directly affects generalizability of the conclusion that prompting and MAS are inadequate.
Authors: We acknowledge that the benchmark relies on synthetic transformations without direct empirical validation against real-world unanswerable queries or deployment data. The two axes (visual modality dependency and evidence sufficiency) were chosen to systematically instantiate common multimodal failure modes documented in prior VLM literature, such as missing visual evidence or insufficient/contradictory support. In the revised manuscript, we will expand Section 3 with an explicit limitations subsection discussing the synthetic nature of the data, the design rationale for each transformation, and the need for future human-validated or log-based benchmarks. This qualifies the generalizability of our conclusions without altering the reported empirical results on the constructed instances. revision: partial
-
Referee: §4–5 (Results and Analysis): The reported findings lack per-transformation error breakdowns or detailed analysis of the 2079 samples, making it difficult to verify support for claims such as 'models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence' and the accuracy-abstention trade-off in MAS.
Authors: We agree that finer-grained breakdowns would improve verifiability of the claims. The current analysis reports aggregate statistics and provides qualitative examples illustrating abstention on absent evidence versus reconciliation attempts on degraded or contradictory evidence. In the revision, we will add per-transformation tables (in the main text or appendix) showing abstention rates, accuracy, and error types for each level of the two axes across the 2079 samples. This will directly support the specific behavioral claims and provide additional detail on the MAS accuracy-abstention trade-off. The underlying data supports these computations. revision: yes
Circularity Check
No circularity: empirical benchmark with direct evaluations
full rationale
The paper is an empirical study that constructs the MM-AQA benchmark via explicit transformations along visual dependency and evidence sufficiency axes, then reports direct model evaluations on 2079 samples across VLMs and MAS architectures. No equations, derivations, fitted parameters, or predictions appear in the provided text. Central claims rest on observed abstention rates, accuracy-abstention trade-offs, and qualitative behaviors under different prompting and agent setups. These are measured outcomes, not reductions of any result to its own inputs by construction. Self-citations, if present, are not load-bearing for the benchmark construction or findings. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard prompting and evaluation metrics for accuracy and abstention rate apply to multimodal models.
Forward citations
Cited by 1 Pith paper
-
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
Reference graph
Works this paper leans on
-
[1]
Association for Computational Linguistics. URL https://aclanthology.org/2025. coling-main.627/. Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images, 2021. URLhttps://arxiv.org/abs/2007.00398. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. InACL, 20...
-
[2]
Aggressive crop: removes the outermost 50% from a randomly selected edge (top/bot- tom/left/right), eliminating axis labels, legends, or annotations depending on selection
-
[3]
Aggressive multi mask: 20 randomly-placed black rectangles, each covering approxi- mately 15% of image area; enforces≥70% total coverage
-
[4]
(2) Occlusion Ambiguity.Preserves image structure while blocking interpretability
Full overlay + Label mask: semi-transparent gray overlay (α≈0.6) applied to the full image, followed by targeted edge masking of detected titles, captions, axis labels, and legends. (2) Occlusion Ambiguity.Preserves image structure while blocking interpretability
-
[5]
Partial occlusion: 9 solid black bars (alternating horizontal and vertical, width W/10 toW/5) placed at regular intervals
-
[6]
Adaptive darkness: all pixel values divided by 15, corresponding to approximately 3.9 stop exposure reduction; image remains non-empty but visually irrecoverable
-
[7]
(3) Semantic Unanswerability.Modifies the question while leaving all images unchanged; requires non-trivial reasoning to detect because image content is intact
Blur with edge mask: Gaussian blur (radius 18) applied to ≈85% of the image, followed by edge masking targeting axis labels and figure titles. (3) Semantic Unanswerability.Modifies the question while leaving all images unchanged; requires non-trivial reasoning to detect because image content is intact
-
[8]
Image-question mismatch: question is rewritten to ask about visual content that is entirely a different phenomenon the image cannot address
-
[9]
Missing context: question is rewritten to require comparative, or extrinsic informa- tion unavailable from the image
-
[10]
Impossible visual task: question is rewritten to require precise measurements, hidden properties, or sub-pixel distinctions not feasible from the image
-
[11]
Nonexistent reference: a specific entity not present in the image is introduced into the question. All variants use constrained LLM rewriting with three validation gates: (i) output length within 120% of input; (ii) semantic divergence confirmed via embedding distance; (iii) grammatical structure check via dependency parse. 15 Preprint. Under review. (4) ...
-
[12]
Multiple correct answers: question is reworded to introduce a conflicting constraint or alternative interpretive convention, such as specifying that a given parameter may be treated under two distinct standards, such that two or more answers options become simultaneously defensible
-
[13]
Ambiguous reference: question is reinvented to ask about non-existent/ fabricated visual elements within the same domain
-
[14]
The resulting contradiction makes it impossible to derive a self-consistent solution
Contradictory premise: question is reworded to introduce a premise that is logically inconsistent with one or more of the given conditions or constraints. The resulting contradiction makes it impossible to derive a self-consistent solution. D.2 A-MMLBD Transformations Nine unique transformations organized into four families targeting evidence-structure di...
-
[15]
Remove visual element: random black rectangular masks are placed in and around the center of the evidence pages where visual evidence density is highest
-
[16]
If OCR based feature detection fails, a fallback blur is applied to the entire page
Blur figure region: a heavy Gaussian blur is applied to the figure or chart region on the evidence page, rendering its visual features, including axis labels, data trends, and color distinctions, unrecognizable. If OCR based feature detection fails, a fallback blur is applied to the entire page
-
[17]
(2) Evidence Corruption.Preserves document structure while altering the semantic content required for answering
Hide evidence page: 50% of the evidence pages of a question are hidden and the remaining 50% are vertically cropped (at midpoint). (2) Evidence Corruption.Preserves document structure while altering the semantic content required for answering
-
[18]
It leaves the surrounding context intact
Remove numeric token: numeric values appearing in a table, chart, or text region of evidence pages are surgically erased (along with random text tokens). It leaves the surrounding context intact
-
[19]
Corrupt numeric token: numeric values in a table, chart, or text region of evidence pages are surgically corrupted (along with erasure of random text tokens), introduc- ing a subtle factual error that cannot be detected from the question text alone
-
[20]
(3) Semantic Contradiction.Injects conflicting information to create ambiguity
Column merge ambiguity: two or more table columns are merged into a single column with an ambiguous composite headers, obscuring the schema structure required to correctly interpret the tabular data. (3) Semantic Contradiction.Injects conflicting information to create ambiguity
-
[21]
(4) Inferential Impossibility.Modifies the question scope to require information outside the document’s coverage
Contradictory caption: an LLM-generated note containing information that directly contradicts the question content is appended at the end. (4) Inferential Impossibility.Modifies the question scope to require information outside the document’s coverage
-
[22]
Counterfactual Distractor Rows: One or more plausible but factually incorrect rows are inserted into a table on the evidence pages, introducing distractor entries that share structural similarity with the ground-truth row
-
[23]
unanswerable
Temporal Drift: the temporal reference in the question, such as a fiscal year or reporting period, is shifted to a time period outside the range covered by the document. 16 Preprint. Under review. E Transformation Selection, Balancing, and Preliminary Verification Scoring and routing.From both benchmark subsets, each transformation is scored against the q...
2023
-
[24]
How many claims are with the highest percentage of reasoning steps in the author’s proposed dataset?
The provided dynamics are not applicable to the moment being asked about. Category 4: Adversarial Ambiguity In this category, images are not transformed. Subject:Geography. Original question: “There is a square five-pile cap as shown in Figure 4-36. The known conditions are as follows: the cushion cap thickness is 1.2 m, the effective height is h0 = 1050 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.