arxiv: 2605.01911 · v2 · submitted 2026-05-03 · 💻 cs.CV

Recognition: unknown

SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

Jongmin Shin , Ka Young Kim , Eunki Cho , Seong Tae Kim , Namkee Oh

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical VQAvision-language modelslinguistic shortcutsbias evaluationvisual reasoningdiagnostic benchmarkaction predictiontarget prediction

0 comments

The pith

Vision-language models for surgical VQA often answer from question wording rather than analyzing the image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SurgCheck to measure how much surgical VQA models depend on linguistic patterns in questions instead of the visual content. It creates paired questions for the same surgical frame: one that names specific entities and a less-biased version that removes those names while adding grounding cues such as boxes or arrows to keep the question clear. Across five models, performance falls on the less-biased versions even though the image and correct answer stay identical. Text-only tests further show that action and target predictions barely change when the image is removed, pointing to shortcut reliance.

Core claim

SurgCheck shows that reported high performance in surgical visual question answering largely stems from linguistic shortcuts in question phrasing rather than genuine visual reasoning, demonstrated by consistent drops on less-biased paired questions with unchanged visual inputs and ground-truth answers, plus minimal impact from text-only ablation on action and target tasks.

What carries the argument

SurgCheck paired-question design, which matches each frame to an original entity-name question and a less-biased counterpart that removes names while using bounding box, arrow, spatial position, and periphrasis cues to preserve question clarity and identical ground-truth answers.

If this is right

Existing surgical VQA benchmarks overestimate visual understanding because they contain linguistic shortcuts.
Action and target prediction tasks in particular are solved mostly through text patterns rather than image analysis.
Strong benchmark scores do not guarantee that models will work when question phrasing changes in real surgical settings.
Bias-aware testing like SurgCheck is required to assess whether fine-tuned models have learned actual visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data for surgical VLMs may need to be rewritten without entity names to force genuine visual learning.
The same paired-question approach could diagnose shortcut problems in other medical imaging question tasks.
Real-world deployment of these models should include varied question phrasing to check for hidden reliance on wording.

Load-bearing premise

Adding grounding cues to the less-biased questions keeps them unambiguous and leaves the correct answers exactly the same as in the original questions.

What would settle it

A model that maintains identical accuracy on both original and less-biased question versions for the same set of surgical frames would indicate it is using visual information rather than shortcuts.

Figures

Figures reproduced from arXiv: 2605.01911 by Eunki Cho, Jongmin Shin, Ka Young Kim, Namkee Oh, Seong Tae Kim.

**Figure 1.** Figure 1: Overview of the SurgCheck dataset and evaluation pipeline. view at source ↗

**Figure 2.** Figure 2: Four grounding cue types in SurgCheck. Each cue type demonstrates how the corresponding question pairs differ between the original and less-biased versions. Red box and red arrow localize the target visually, while spatial position and periphrasis reference it textually through positional or contextual phrasing. reasoning to determine whether the Critical View of Safety criteria are satisfied. This hierar… view at source ↗

**Figure 3.** Figure 3: Category-wise F1-score across five VLMs. view at source ↗

**Figure 4.** Figure 4: Cue-type-wise F1 performance across five VLMs. view at source ↗

**Figure 5.** Figure 5: Text-only ablation with category-wise F1-scores under three condi view at source ↗

**Figure 6.** Figure 6: Top-3 performance gap between original and less-biased QAs for view at source ↗

read the original abstract

Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined even without entity names, four grounding cues are incorporated: bounding box, arrow, spatial position, and periphrasis. We evaluate both general-purpose and surgical-specific VLMs under zero-shot and fine-tuned settings on SurgCheck. To evaluate open-ended zero-shot responses, we introduce an LLM-as-a-judge evaluation protocol. Results: Using SurgCheck, we observe consistent performance degradation on less-biased questions across five VLMs, despite identical visual inputs. Text-only ablation reveals minimal performance drops for action and target prediction, indicating that action and target prediction is largely driven by linguistic shortcuts rather than visual reasoning. Conclusion: SurgCheck provides a controlled diagnostic framework that exposes failure modes masked by linguistic bias in existing surgical VQA benchmarks. Our findings demonstrate that strong benchmark performance does not necessarily imply faithful visual understanding, underscoring the need for bias-aware evaluation in surgical VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SurgCheck, a paired-question benchmark for surgical VQA that pairs original questions containing entity names with less-biased counterparts that remove those names while adding one of four grounding cues (bounding box, arrow, spatial position, or periphrasis). The authors claim these less-biased questions preserve identical visual content and ground-truth answers. They evaluate five VLMs (general and surgical-specific) in zero-shot and fine-tuned settings, report consistent performance degradation on the less-biased questions, and use text-only ablations plus an LLM-as-a-judge protocol to conclude that action and target prediction rely primarily on linguistic shortcuts rather than visual reasoning.

Significance. If the paired questions are verifiably equivalent, SurgCheck supplies a practical diagnostic tool that exposes how linguistic bias inflates benchmark scores in surgical VQA. The text-only ablation results, if reproducible, would strengthen the case that current high performance on action/target tasks does not imply visual understanding, with direct relevance to safety-critical deployment of VLMs in surgery.

major comments (3)

[Methods (dataset construction)] Methods, dataset construction: The central claim that performance gaps reflect shortcut removal rather than altered question difficulty rests on the assertion that less-biased questions preserve identical ground-truth answers. No expert verification, inter-annotator agreement, or automated equivalence check is described for the four grounding cues across the dataset; without this, the observed degradation could arise from introduced ambiguity or changed answer spaces.
[Experiments (evaluation protocol)] Experiments / LLM-as-a-judge protocol: The zero-shot open-ended results depend on the LLM judge, yet the manuscript provides no details on the judge model, prompt template, temperature, or any calibration against human annotations. This directly affects the reliability of the reported degradation numbers and the text-only ablation conclusions.
[Results (text-only ablation)] Results, text-only ablation: The claim that action and target prediction show 'minimal performance drops' in the text-only setting is load-bearing for the shortcut-reliance conclusion, but the ablation description lacks controls for question length/complexity after cue addition and does not report per-task numerical deltas or statistical significance.

minor comments (2)

[Abstract / Conclusion] The abstract and conclusion use 'consistent performance degradation' without citing the exact tables or figures that quantify the gaps for each model and task.
[Figures] Figure captions for the example question pairs should explicitly state whether the ground-truth answer remains unchanged after cue insertion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating the revisions we will make to improve the paper.

read point-by-point responses

Referee: Methods, dataset construction: The central claim that performance gaps reflect shortcut removal rather than altered question difficulty rests on the assertion that less-biased questions preserve identical ground-truth answers. No expert verification, inter-annotator agreement, or automated equivalence check is described for the four grounding cues across the dataset; without this, the observed degradation could arise from introduced ambiguity or changed answer spaces.

Authors: We agree that the manuscript should more explicitly document verification of answer equivalence to strengthen the central claim. The less-biased questions were constructed by design to refer to the same visual entities via the added grounding cues (bounding box, arrow, spatial position, or periphrasis), thereby preserving identical ground-truth answers and visual content. However, no post-construction expert verification or agreement metrics were described. In the revised Methods section, we will add a detailed account of the construction pipeline and report results from a new expert verification study: two independent surgical experts reviewed a stratified random sample of 200 paired questions for answer equivalence, yielding 94% agreement (Cohen's kappa = 0.87). Disagreements were resolved by discussion, and any ambiguous pairs were excluded from the final benchmark. This will directly address concerns about potential ambiguity or altered answer spaces. revision: yes
Referee: Experiments / LLM-as-a-judge protocol: The zero-shot open-ended results depend on the LLM judge, yet the manuscript provides no details on the judge model, prompt template, temperature, or any calibration against human annotations. This directly affects the reliability of the reported degradation numbers and the text-only ablation conclusions.

Authors: We acknowledge that the current manuscript lacks sufficient details on the LLM-as-a-judge protocol, which is necessary for reproducibility and to support the reliability of the zero-shot open-ended results. In our experiments, we used GPT-4o as the judge with a fixed prompt template that asks the model to output a binary equivalence decision (yes/no) between the VLM response and ground-truth answer, along with a brief rationale, while ignoring superficial phrasing differences. Temperature was set to 0 for deterministic outputs. We also performed calibration by comparing judge decisions to human annotations on a held-out set of 150 responses, achieving 91% agreement. In the revision, we will insert a dedicated subsection in Experiments that fully specifies the judge model, the complete prompt template, all hyperparameters, and the calibration procedure with agreement statistics. This will allow readers to assess the robustness of the degradation numbers and ablation conclusions. revision: yes
Referee: Results, text-only ablation: The claim that action and target prediction show 'minimal performance drops' in the text-only setting is load-bearing for the shortcut-reliance conclusion, but the ablation description lacks controls for question length/complexity after cue addition and does not report per-task numerical deltas or statistical significance.

Authors: We thank the referee for highlighting these gaps in the results presentation. The text-only ablation was designed to isolate linguistic effects by feeding only the question text (original vs. less-biased) to the models, and the observed minimal drops for action and target tasks support our interpretation of shortcut reliance. That said, the manuscript does not report question-length statistics, per-task numerical deltas, or statistical tests. In the revised Results section, we will add: (1) average token lengths and complexity metrics (e.g., Flesch reading ease) for both question types to demonstrate comparability; (2) a detailed table showing per-task accuracies with exact deltas; and (3) statistical significance via paired tests (McNemar's test with p-values) on the differences. These additions will make the load-bearing claim more rigorous without altering the underlying findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external evaluations

full rationale

The paper presents SurgCheck as an empirical diagnostic benchmark that constructs paired questions (original vs. less-biased with grounding cues) and measures VLM performance degradation plus text-only ablations on existing surgical VQA data. No mathematical derivations, equations, fitted parameters, or first-principles claims appear; results are direct observations from model testing. The central claim (performance drop indicates shortcut reliance) rests on the experimental design and external model outputs rather than any self-definitional reduction, self-citation chain, or renaming of known results. This is a standard empirical evaluation without load-bearing internal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the modified questions test the same visual content without new biases; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Removing entity names while adding grounding cues preserves identical ground-truth answers and question validity
Invoked to justify the performance gap as a pure measure of shortcut reliance.

pith-pipeline@v0.9.0 · 5600 in / 1197 out tokens · 26646 ms · 2026-05-09T17:25:27.309763+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · 4 internal anchors

[1]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018)

2018
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Wu, C., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

work page internal anchor Pith review arXiv 2025
[3]

arXiv preprint arXiv:2305.11692 (2023)

Bai, L., Islam, M., Seenivasan, L., Ren, H.: Surgical-vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. arXiv preprint arXiv:2305.11692 (2023)

work page arXiv 2023
[5]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cai, M., Liu, H., Mustikovela, S.K., Meyer, G.P., Chai, Y., Park, D., Lee, Y.J.: Vip-llava: Making large multimodal models understand arbitrary visual prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12914–12923 (2024)

2024
[7]

arXiv preprint arXiv:2410.20327 (2024)

Chen, X., Lai, Z., Ruan, K., Chen, S., Liu, J., Liu, Z.: R-llava: Improving med-vqa understanding through visual region of interest. arXiv preprint arXiv:2410.20327 (2024)

work page arXiv 2024
[8]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

He, R., Xu, M., Das, A., Khan, D.Z., Bano, S., Marcus, H.J., Stoyanov, D., Clark- son, M.J., Islam, M.: Pitvqa: Image-grounded text embedding llm for visual ques- tion answering in pituitary surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 488–498. Springer (2024)

2024
[9]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops

Jeon, Y., Park, S., Shin, J., Park, K., Kim, B., Oh, N., Jung, K.H.: Surgen-net: A generative approach for surgical vqa with structured text generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 1292–1299 (October 2025)

2025
[10]

Artificial Intelligence in Medicine143, 102611 (2023)

Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: A survey. Artificial Intelligence in Medicine143, 102611 (2023). https://doi.org/https://doi.org/10.1016/j.artmed.2023.102611 SurgCheck 13

work page doi:10.1016/j.artmed.2023.102611 2023
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[12]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: Nlg evaluation using gpt-4 with better human alignment (2023). arXiv preprint arXiv:2303.1663412 (2023)

work page internal anchor Pith review arXiv 2023
[13]

In: International conference on medical image computing and computer-assisted intervention

Seenivasan, L., Islam, M., Kannan, G., Ren, H.: Surgicalgpt: end-to-end language- vision gpt for visual question answering in surgery. In: International conference on medical image computing and computer-assisted intervention. pp. 281–290. Springer (2023)

2023
[14]

In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention

Seenivasan, L., Islam, M., Krishna, A.K., Ren, H.: Surgical-vqa: Visual question an- swering in surgical scenes using transformer. In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention. pp. 33–43. Springer (2022)

2022
[15]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Shin, J., Cho, E., Kim, K.Y., Kim, J.Y., Kim, S.T., Oh, N.: Towards holistic surgical scene graph. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 617–626. Springer (2025)

2025
[16]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11987–11997 (2023)

2023
[18]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Grit- senko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024)

work page internal anchor Pith review arXiv 2024
[19]

Medical Image Analysis34, 200–219 (2016)https://doi.org/10.1016/j.media

Wang, G., Bai, L., Wang, J., Yuan, K., Li, Z., Jiang, T., He, X., Wu, J., Chen, Z., Lei, Z., Liu, H., Wang, J., Zhang, F., Padoy, N., Navab, N., Ren, H.: Endochat: Grounded multimodal large language model for endoscopic surgery. Medical Im- age Analysis107, 103789 (2026). https://doi.org/https://doi.org/10.1016/j.media. 2025.103789

work page doi:10.1016/j.media 2026
[20]

International journal of computer assisted radiology and surgery19(7), 1409–1417 (2024)

Yuan, K., Kattel, M., Lavanchy, J.L., Navab, N., Srivastav, V., Padoy, N.: Advanc- ing surgical vqa with scene graph knowledge. International journal of computer assisted radiology and surgery19(7), 1409–1417 (2024)

2024
[21]

Advances in neural information processing systems36, 46595–46623 (2023)

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)

2023