pith. sign in

arxiv: 2606.16494 · v2 · pith:U3DWW6Z4new · submitted 2026-06-15 · 💻 cs.CL · cs.AI· cs.CV

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

Pith reviewed 2026-06-29 05:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords multimodal QAretrieval-augmented generationposition biasprimacy effectKB-VQAvision-language modelslong contextlost at the end
0
0 comments X

The pith

Multimodal KB-VQA readers exhibit a strong primacy bias where the gold passage at the first position outperforms the same passage at the last by 16-26 accuracy points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the U-shaped lost-in-the-middle effect from text long-context models carries over to multimodal retrieval-augmented visual question answering. Using a controlled gold-position protocol on three 7B/8B vision-language readers and two benchmarks, it finds the performance curve instead follows a primacy pattern: information placed first is used far more reliably than information placed last. Ablations show the multimodal setting amplifies an existing text-mode primacy effect by 2.2 to 4.5 times and localize the problem to the first prompt slot of the instruction-tuned reader. Retrieval-side interventions such as MMR, oracle reranking, and rank reordering leave the gap unchanged, indicating that standard recall@k metrics fail to capture actual reader behavior.

Core claim

In multimodal KB-VQA the position dependence of retrieved passages is not U-shaped but primacy-driven: across every reader-by-benchmark combination, placing the single gold passage in the first slot yields 16-26 points higher accuracy than placing it in the final slot. The effect is driven by the reader itself rather than by image encoding or distractor content, and standard retrieval fixes do not close it.

What carries the argument

The gold-position protocol, which holds every element fixed except the prompt slot occupied by the single gold passage.

If this is right

  • Recall@k ceases to be a reliable proxy for end-to-end KB-VQA performance once position bias is present.
  • Any deployed multimodal system that concatenates multiple retrieved passages will under-use passages that land near the end of the context.
  • Reader-side architectural or fine-tuning changes are required to mitigate the bias, because retrieval reordering alone does not help.
  • The same protocol can be reused as a diagnostic instrument to measure whether future reader interventions reduce the first-versus-last gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that dynamically reorder passages based on reader attention patterns could be tested with this protocol to see whether they restore more uniform position use.
  • The amplification from text to multimodal may point to competition between visual tokens and late text tokens for the reader's limited attention budget.
  • If the bias is strongest at the very first slot, instruction-tuning data that always places key facts first could be a cheap source of the observed behavior.

Load-bearing premise

Varying only the gold passage's slot inside the prompt is enough to isolate reader-side position effects without leftover confounds from image features or question phrasing.

What would settle it

Running the same gold-position sweeps on the same readers and benchmarks and observing no consistent 16-plus-point gap between first and last placement.

Figures

Figures reproduced from arXiv: 2606.16494 by Jefferson Chen, Jianyang Gu, Jieyuan Liu, Shijie Chen, Zhen Wang.

Figure 1
Figure 1. Figure 1: Pipeline and position-permutation protocol. The image and question are encoded by a frozen PreFLMR ViT-G retriever, which returns 50 candidate passages. From these we construct three 10-passage prompts that are bit-identical except for the index of the gold passage g (red): first (index 0), middle (index 4), last (index 9). Nine distractors d0, . . . , d8 (grey) are identical across cells per question. A f… view at source ↗
Figure 2
Figure 2. Figure 2: Position effect across k on six reader×dataset cells. Top row: INFOSEEK (Qwen2.5-VL-7B, InternVL3- 8B, Qwen3-VL-8B, left to right); bottom row: E-VQA (same column ordering). First decays gracefully on all six panels; middle and last drop steeply and collapse together below it. The middle-vs-last ordering on Qwen3-VL-8B E-VQA separably inverts at higher k (last > middle by 5.2 pp at k=20, paired-bootstrap C… view at source ↗
Figure 3
Figure 3. Figure 3: Reader follows prompt slot, not retrieval rank (INFOSEEK). Distractor-lift error breakdown for the three shuffle cells (gold fixed at slot 9). Slot-0 selec￾tion (blue) dominates in dissociable cells B+C; rank-1 selection (orange) runs at the dashed baseline. E-VQA replication is in App. I. tion runs at 1.01× baseline. Cell B is the smoking gun: with the least-similar (rank-9) distractor at slot 0, the read… view at source ↗
Figure 4
Figure 4. Figure 4: Diversification doesn’t help. MMR at three λ values fails to improve over vanilla top-k at k=5 or k=10. Every paired-bootstrap CI on ∆(MMR−vanilla) overlaps zero. not a characterized result: the experiment is under￾powered at n=254 (the gold-in-top-10 subset of our anchor cell) and a larger gold-in-top-10 subset would be needed to characterize the first-vs-native gain (App. O). Insight 4: Invest in reader-… view at source ↗
Figure 5
Figure 5. Figure 5: Failure-bucket distribution shifts with gold position. Per-cell breakdown of P17 INFOSEEK Qwen2.5-VL-7B error rows at k=10 (n=962; 272 first, 336 middle, 354 last). As gold moves first → last, the modal failure mode flips from extraction_failed (36.4% → 13.8%) to distractor_lift (30.1% → 56.5%). G Modality Amplification Visualization [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Modality amplification on Qwen2.5-VL￾7B (illustrative). The multimodal setting amplifies the text-mode primacy effect by 2.2× on INFOSEEK and 3.6× on E-VQA for Qwen2.5-VL-7B; analogous ratios for InternVL3-8B and Qwen3-VL-8B are reported in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: shows the cross-tabulation for the P24 shuffle replication on Qwen2.5-VL-7B × E-VQA gold-at-last (n=500). The qualitative pattern is identical to the INFOSEEK result ( [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Image-token proximity is not the primary driver of first-vs-last. Per-cell accuracy under image￾at-start (P17 default) vs. image-at-end. First-vs-last gap preserved (1.12× anchor, GENERALIZATION); mid￾dle/last ordering inverts under image-at-end (App. H discusses). I Shuffle Replication on E-VQA (P24) [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: shows the suggestive but underpowered oracle-rerank result described in §4.4. native first last rerank configuration (n=254 gold-in-top-10 subset) 0 10 20 30 40 50 60 70 80 accuracy (%) 47.64 50.79 42.13 Δ +3.15 [-0.39, +6.69] Δ -5.51 [-9.84, -1.18] [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a gold-position protocol for probing reader-side position dependence in multimodal KB-VQA, where only the slot of the gold passage varies while holding other elements fixed. Across three 7B/8B VLMs and two benchmarks at k≤20, it reports a primacy bias ('Lost at the End') in which gold-at-first outperforms gold-at-last by 16–26 points on every reader-by-benchmark cell, flipping the U-shape observed in text-only long-context models. Targeted ablations (text-only control, image-position, distractor-shuffle) attribute the effect to amplification of text-mode primacy in the multimodal setting and localization at prompt slot 0; retrieval-side interventions (MMR, oracle reranking, rank reordering) leave the gap intact on a frozen reader. The protocol is released for evaluating reader-side fixes.

Significance. If the primacy effect is shown to be isolated from template or encoding confounds, the result would be significant for KB-VQA deployment: it implies recall@k is an incomplete metric and that reader-side interventions are required. The controlled gold-position design and the three ablations that narrow the locus constitute a clear empirical contribution; releasing the protocol adds reproducibility value. The directional consistency across cells is a strength, but the lack of error bars or tests noted in the review limits assessment of effect reliability.

major comments (3)
  1. [Methods] Methods (gold-position protocol description): the protocol is claimed to vary only the gold passage's prompt slot while keeping prompt formatting, image encoding order, and distractor content fixed, yet the text does not explicitly state that a single fixed template with k slots is instantiated for every position. This is load-bearing for the central 16–26 point claim, because any position-dependent template artifact (e.g., special first-slot handling or image-token ordering) could contribute to the measured gap.
  2. [Results] Results (performance tables or figures reporting the 16–26 point gaps): no error bars, standard deviations across runs, or statistical significance tests are mentioned, despite the abstract claiming 'consistent directional results.' This undermines confidence that the reported effect sizes exceed noise, especially given the low soundness rating on exclusion criteria and post-hoc choices.
  3. [Ablations] Ablations (text-only control and image-position ablation): while these narrow the cause to the multimodal reader, the text-only control is reported to amplify the text-mode primacy by 2.2–4.5×, but without quantitative details on how the multimodal prompt template was matched to the text-only version, residual differences in tokenization or instruction formatting cannot be ruled out as partial contributors.
minor comments (2)
  1. Figure captions or tables should include the exact number of questions per benchmark and the precise k values used for each reported cell to allow direct replication.
  2. [Abstract] The abstract states 'three open-source 7B/8B VLM readers' but does not name the models; this should be added for clarity even if named later in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our gold-position protocol and strengthen the empirical claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Methods] Methods (gold-position protocol description): the protocol is claimed to vary only the gold passage's prompt slot while keeping prompt formatting, image encoding order, and distractor content fixed, yet the text does not explicitly state that a single fixed template with k slots is instantiated for every position. This is load-bearing for the central 16–26 point claim, because any position-dependent template artifact (e.g., special first-slot handling or image-token ordering) could contribute to the measured gap.

    Authors: We agree that the current Methods description would benefit from an explicit statement confirming use of a single fixed template. In the revised manuscript we will add the following sentence: 'Across all positions p we instantiate one fixed k-slot template; only the slot index receiving the gold passage changes while distractor content, formatting tokens, and image encoding order remain identical.' This directly rules out template artifacts as a source of the reported gap. revision: yes

  2. Referee: [Results] Results (performance tables or figures reporting the 16–26 point gaps): no error bars, standard deviations across runs, or statistical significance tests are mentioned, despite the abstract claiming 'consistent directional results.' This undermines confidence that the reported effect sizes exceed noise, especially given the low soundness rating on exclusion criteria and post-hoc choices.

    Authors: We accept that the absence of variability measures weakens the presentation. In revision we will add standard deviations computed over three random seeds for retrieval ordering (where stochastic) and include paired significance tests (McNemar) between gold-first and gold-last conditions in all tables and figures. These additions will be reported for every reader-by-benchmark cell. revision: yes

  3. Referee: [Ablations] Ablations (text-only control and image-position ablation): while these narrow the cause to the multimodal reader, the text-only control is reported to amplify the text-mode primacy by 2.2–4.5×, but without quantitative details on how the multimodal prompt template was matched to the text-only version, residual differences in tokenization or instruction formatting cannot be ruled out as partial contributors.

    Authors: We will expand the ablation subsection with the requested quantitative details: we will report the exact token counts for the multimodal versus text-only templates (differing only by image tokens), list all instruction tokens that were retained or removed, and note any tokenizer-specific special tokens. These numbers will be added to the text-only control paragraph so readers can evaluate residual formatting differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or fitted predictions

full rationale

The paper reports direct experimental results from a gold-position protocol and targeted ablations on VLM readers. No equations, parameters fitted to subsets then re-predicted, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The central claim (16-26 point primacy gap) is a measured difference under controlled variation of gold slot only; it does not reduce to any input quantity by construction. This is the normal case of an empirical study whose validity rests on experimental controls rather than algebraic identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the gold-position protocol and the assumption that observed accuracy differences are attributable to prompt position rather than unmeasured interactions between vision and text encoders.

axioms (1)
  • domain assumption The gold-position protocol isolates reader-side position dependence without residual confounds from image encoding, distractor content, or benchmark-specific question phrasing.
    This assumption is required to attribute the 16-26 point gap specifically to primacy bias rather than other experimental factors.

pith-pipeline@v0.9.1-grok · 5851 in / 1311 out tokens · 32958 ms · 2026-06-29T05:20:40.849741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 4 canonical work pages

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. https://arxiv.org/abs/2502.13923 Qwen2.5-vl technical report . arXiv preprint arXiv:2502.13923

  2. [2]

    Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-LLaVA : Hierarchical retrieval-augmented generation for multimodal LLMs . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1818--1826

  3. [3]

    Jaime Carbonell and Jade Goldstein. 1998. https://doi.org/10.1145/290941.291025 The use of MMR , diversity-based reranking for reordering documents and producing summaries . In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335--336

  4. [4]

    Shijie Chen, Bernal Jim \'e nez Guti \'e rrez, and Yu Su. 2025. https://arxiv.org/abs/2410.02642 Attention in large language models yields efficient zero-shot re-rankers . In The Thirteenth International Conference on Learning Representations (ICLR)

  5. [5]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 14948--14968

  6. [6]

    Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2025. Augmenting multimodal LLMs with self-reflective tokens for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9199--9209

  7. [7]

    Florin Cuconasu, Simone Filice, Guy Horowitz, Yoelle Maarek, and Fabrizio Silvestri. 2025. Do RAG systems really suffer from positional bias? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 28022--28036, Suzhou, China

  8. [8]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 InstructBLIP : Towards general-purpose vision-language models with instruction tuning . In Advances in Neural Information Processing Systems (NeurIPS)

  9. [9]

    Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, and Tomas Pfister

    Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, and Tomas Pfister. 2024. Found in the middle: Calibrating positional attention bias improves long context utilization. In Findings of the Association for Computational Linguistics (ACL), pages 14982--14995

  10. [10]

    Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, and Zhengzhong Tu. 2025. https://arxiv.org/abs/2505.24073 mRAG : Elucidating the design space of multi-modal retrieval-augmented generation . arXiv preprint arXiv:2505.24073

  11. [11]

    Jan Hutter, David Rau, Maarten Marx, and Jaap Kamps. 2025. Lost but not only in the middle: Positional bias in retrieval augmented generation. In Advances in Information Retrieval (ECIR), volume 15572 of LNCS, pages 247--261. Springer

  12. [12]

    Omar Khattab and Matei Zaharia. 2020. https://doi.org/10.1145/3397271.3401075 ColBERT : Efficient and effective passage search via contextualized late interaction over BERT . In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39--48

  13. [13]

    Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. 2024. PreFLMR : Scaling up fine-grained late-interaction multi-modal retrievers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 5294--5316

  14. [14]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://arxiv.org/abs/2304.08485 Visual instruction tuning . In Advances in Neural Information Processing Systems (NeurIPS)

  15. [15]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://doi.org/10.1162/tacl_a_00638 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12:157--173

  16. [16]

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 8086--8098

  17. [17]

    Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. 2023. Encyclopedic VQA : Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3113--3124

  18. [18]

    OpenAI . 2023. GPT-4V(ision) system card. https://openai.com/index/gpt-4v-system-card/

  19. [19]

    Smith, and Mike Lewis

    Ofir Press, Noah A. Smith, and Mike Lewis. 2022. https://arxiv.org/abs/2108.12409 Train short, test long: Attention with linear biases enables input length extrapolation . In The Tenth International Conference on Learning Representations (ICLR)

  20. [20]

    Qwen Team . 2025. https://arxiv.org/abs/2511.21631 Qwen3-vl technical report . arXiv preprint arXiv:2511.21631

  21. [21]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063 RoFormer : Enhanced transformer with rotary position embedding . Neurocomputing, 568:127063

  22. [22]

    Zhijie Tan, Xu Chu, Weiping Li, and Tong Mo. 2024. https://arxiv.org/abs/2410.16983 Order matters: Exploring order sensitivity in multimodal large language models . arXiv preprint arXiv:2410.16983

  23. [23]

    Raphael Tang, Crystina Zhang, Xueguang Ma, Jimmy Lin, and Ferhan Ture. 2024. https://aclanthology.org/2024.naacl-long.129/ Found in the middle: Permutation self-consistency improves listwise ranking in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

  24. [24]

    Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. 2025. Identifying and mitigating position bias of multi-image vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10599--10609

  25. [25]

    Kebin Wu and Fatima Albreiki. 2025. https://arxiv.org/abs/2511.11216 Positional bias in multimodal embedding models: Do they favor the beginning, the middle, or the end? arXiv preprint arXiv:2511.11216

  26. [26]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. https://arxiv.org/abs/2309.17453 Efficient streaming language models with attention sinks . In The Twelfth International Conference on Learning Representations (ICLR)

  27. [27]

    Yibin Yan and Weidi Xie. 2024. EchoSight : Advancing visual-language models with wiki knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1538--1551, Miami, Florida, USA

  28. [28]

    Jiayu Yao, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Yuyao Ge, Zhecheng Li, and Xueqi Cheng. 2025. Who is in the spotlight: The hidden bias undermining multimodal retrieval-augmented generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 15183--15193

  29. [29]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, and 32 others. 2025. https://arxiv.org/abs/2504.10479 InternVL3 : Exploring advanced training and test-time recipes for open-s...