A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Pith reviewed 2026-06-28 05:55 UTC · model grok-4.3
The pith
The position of a video in a multi-video input affects summary quality in MLLMs, with effects that vary by domain and model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Positional effects in multi-video summarization are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance.
What carries the argument
Three complementary metrics—Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG)—applied to a benchmark of two- and four-video inputs from ActivityNet and News videos to measure how input slot changes summary quality for unchanged content.
Load-bearing premise
The chosen ActivityNet and News video clips remain representative of real multi-video inputs and that the three metrics isolate positional bias without confounding effects from video length or content complexity.
What would settle it
An experiment that swaps video positions across identical content sets and checks whether measured differences in summary quality reverse or disappear for the tested models and domains.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic empirical evaluation of positional bias in multi-video summarization tasks performed by nine open-source and proprietary MLLMs. Using a benchmark constructed from ActivityNet and News video clips across Cooking, Domestic, Leisure, and News domains, the authors examine two- and four-video inputs and quantify position effects via three metrics (Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG)). The central claim is that positional effects are domain- and model-dependent, that signed directional bias can remain small even when middle positions underperform, and that increasing visual or generation budgets does not uniformly eliminate the imbalance; prompt-level mitigation strategies are also analyzed.
Significance. If the results hold after addressing methodological gaps, the work is significant as a timely measurement study that documents concrete limitations of current MLLMs when processing multiple videos—an increasingly common input regime. The domain- and model-specific patterns, together with the complementary metrics, supply practitioners with actionable diagnostics and motivate the development of order-invariant multimodal architectures. The purely empirical design avoids circularity and offers falsifiable observations that can be directly replicated or extended.
major comments (2)
- [Abstract] Abstract and Methods: The abstract states results on domain- and model-dependence but supplies no statistical tests, error bars, sample sizes, or exclusion criteria. Without these, it is impossible to judge whether the reported effects are robust or whether observed middle-position underperformance reflects sampling variability.
- [Evaluation Metrics] Evaluation setup and metric definitions: The central claim requires that Coverage, DPB, and MEG isolate positional bias after permuting identical video sets. If ActivityNet/News clips vary in duration or scene density and position assignment correlates with these properties, the observed imbalances could be driven by content rather than slot. The manuscript must explicitly describe length normalization, content-matched controls, or permutation protocols to rule out this confound.
minor comments (1)
- [Abstract] Abstract: Adding a parenthetical note on the total number of video clips or input instances used would help readers gauge the scale of the benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on statistical reporting and methodological details.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: The abstract states results on domain- and model-dependence but supplies no statistical tests, error bars, sample sizes, or exclusion criteria. Without these, it is impossible to judge whether the reported effects are robust or whether observed middle-position underperformance reflects sampling variability.
Authors: The abstract serves as a concise overview; full details on sample sizes (50 video sets per domain and input size, each evaluated over 5 random seeds), error bars (standard error in all figures/tables), and exclusion criteria (videos <10s or with transcription failures removed) appear in Section 3 and the appendix. We agree the abstract should signal robustness and will revise it to note 'results averaged over permutations with standard errors'. We will also add Wilcoxon signed-rank tests for position effects to the Methods section. revision: yes
-
Referee: [Evaluation Metrics] Evaluation setup and metric definitions: The central claim requires that Coverage, DPB, and MEG isolate positional bias after permuting identical video sets. If ActivityNet/News clips vary in duration or scene density and position assignment correlates with these properties, the observed imbalances could be driven by content rather than slot. The manuscript must explicitly describe length normalization, content-matched controls, or permutation protocols to rule out this confound.
Authors: The benchmark explicitly constructs sets from the same domain with comparable durations (clips selected within 20% length variance) and applies random permutations of video order for every trial, ensuring each video appears equally often in every position across the experiment. This isolates positional effects. We will add an expanded 'Benchmark Construction' paragraph detailing the exact permutation protocol, set-matching criteria, and confirmation that no post-hoc length normalization is applied beyond initial selection. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper constructs a benchmark from ActivityNet and News videos, evaluates nine MLLMs on two- and four-video inputs, and reports position effects via three metrics (Coverage, DPB, MEG). No equations, fitted parameters, uniqueness theorems, or self-citations appear in the derivation chain; all claims rest on direct experimental measurements rather than any reduction of outputs to inputs by construction. The central results (domain- and model-dependent positional effects) are therefore independent of the inputs and receive no circularity penalty.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video content remains identical when its input position changes.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =
Kansal, Kajal and Kansal, Nikita and Bavana, Sreevaatsav and Vamshi, Bodla Krishna and Goyal, Nidhi , title =. Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =. 2023 , isbn =. doi:10.1145/3607540.3617139 , abstract =
-
[2]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Zhang, Hang and Li, Xin and Bing, Lidong. Video- LL a MA : An Instruction-tuned Audio-Visual Language Model for Video Understanding. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.49
-
[3]
2025 , eprint=
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
2025
-
[4]
MLVU: Benchmarking Multi-task Long Video Understanding , year=
Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Liang, Zhengyang and Xiao, Shitao and Qin, Minghao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng , booktitle=. MLVU: Benchmarking Multi-task Long Video Understanding , year=
-
[5]
Proceedings of the AAAI conference on artificial intelligence , volume=
Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[6]
Jung, Woojun and Kim, Junyeong. QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1340
-
[7]
Grenander, Matt and Dong, Yue and Cheung, Jackie Chi Kit and Louis, Annie. Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ...
-
[8]
URLhttps://aclanthology.org/2024.tacl-1.9/
Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638
-
[9]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Identifying and mitigating position bias of multi-image vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[10]
2025 , eprint=
Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models , author=. 2025 , eprint=
2025
-
[11]
2015 , volume=
Heilbron, Fabian Caba and Escorcia, Victor and Ghanem, Bernard and Niebles, Juan Carlos , booktitle=. 2015 , volume=
2015
-
[12]
Incorporating Background Knowledge into Video Description Generation
Whitehead, Spencer and Ji, Heng and Bansal, Mohit and Chang, Shih-Fu and Voss, Clare. Incorporating Background Knowledge into Video Description Generation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1433
-
[13]
Schilcher, Patrick and Karasin, Dominik and Sch. Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1124
-
[14]
P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization
Sun, Xu and Delphin-Poulat, Lionel and Tarnec, Christ \`e le and Shimorina, Anastasia. P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.404
-
[15]
European Conference on Computer Vision , year=
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=
-
[16]
Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan
Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J. Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1656
-
[17]
Large Language Models are not Fair Evaluators
Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...
-
[18]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
2004
-
[19]
METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005
2005
-
[20]
2020 , eprint=
BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=
2020
-
[21]
Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =
2023
-
[22]
Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge
Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush. Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguis...
2025
-
[23]
2024 , eprint=
Large Language Models are Zero-Shot Rankers for Recommender Systems , author=. 2024 , eprint=
2024
-
[24]
Zhang, Zhenyu and Chen, Runjin and Liu, Shiwei and Yao, Zhewei and Ruwase, Olatunji and Chen, Beidi and Wu, Xiaoxia and Wang, Zhangyang , booktitle =. Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =. doi:10.52202/079017-1943 , editor =
-
[25]
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Pennec, Galann and Liu, Zhengyuan and Asher, Nicholas and Muller, Philippe and Chen, Nancy F. Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Comput...
2025
-
[26]
On Positional Bias of Faithfulness for Long-form Summarization
Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq. On Positional Bias of Faithfulness for Long-form Summarization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.442
-
[27]
Split and Merge: Aligning Position Biases in LLM -based Evaluators
Li, Zongjie and Wang, Chaozheng and Ma, Pingchuan and Wu, Daoyuan and Wang, Shuai and Gao, Cuiyun and Liu, Yang. Split and Merge: Aligning Position Biases in LLM -based Evaluators. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.621
-
[28]
Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =
Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29
-
[29]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153
-
[30]
GPTS core: Evaluate as You Desire
Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365
-
[31]
B leu: a Method for Automatic Evaluation of Machine Translation
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135
-
[32]
Evaluating the Factual Consistency of Abstractive Text Summarization
Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard. Evaluating the Factual Consistency of Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.750
-
[33]
2025 , eprint=
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. 2025 , eprint=
2025
-
[34]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[35]
2023 , eprint=
VideoLLM: Modeling Video Sequence with Large Language Models , author=. 2023 , eprint=
2023
-
[36]
Multimodal Abstractive Summarization for How2 Videos
Palaskar, Shruti and Libovick \'y , Jind r ich and Gella, Spandana and Metze, Florian. Multimodal Abstractive Summarization for How2 Videos. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1659
-
[37]
2023 , eprint=
Token Merging: Your ViT But Faster , author=. 2023 , eprint=
2023
-
[38]
2025 , eprint=
Qwen3-VL Technical Report , author=. 2025 , eprint=
2025
-
[39]
2025 , eprint=
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=
2025
-
[40]
VCSUM : A Versatile C hinese Meeting Summarization Dataset
Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi. VCSUM : A Versatile C hinese Meeting Summarization Dataset. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.377
-
[41]
Rahimi, Elahe and Sajjad, Hassan and Rosati, Domenic and Badawi, Abeer and Dolatabadi, Elham and Rudzicz, Frank. Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.846
-
[42]
2024 , eprint=
CIDER: Counterfactual-Invariant Diffusion-based GNN Explainer for Causal Subgraph Inference , author=. 2024 , eprint=
2024
-
[43]
Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework
Zhang, Shaolei and Feng, Yang. Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.467
-
[44]
Saito, Kuniaki and Lee, Chen-Yu and Sohn, Kihyuk and Ushiku, Yoshitaka. Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...
-
[45]
N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
Grusky, Max and Naaman, Mor and Artzi, Yoav. N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1065
-
[46]
2025 , howpublished =
GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum , author =. 2025 , howpublished =
2025
-
[47]
2018 , month = nov, url =
Many Turn to YouTube for Children's Content, News, How-To Lessons , author =. 2018 , month = nov, url =
2018
-
[48]
Covington, Paul and Adams, Jay and Sargin, Emre , title =. 2016 , isbn =. doi:10.1145/2959100.2959190 , booktitle =
-
[49]
2025 , url=
Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu , booktitle=. 2025 , url=
2025
-
[50]
2026 , eprint=
MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding , author=. 2026 , eprint=
2026
-
[51]
Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation
Rebuffel, Clement and Scialom, Thomas and Soulier, Laure and Piwowarski, Benjamin and Lamprier, Sylvain and Staiano, Jacopo and Scoutheeten, Geoffrey and Gallinari, Patrick. Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.1...
-
[52]
2025 , eprint=
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=
2025
-
[53]
2026 , eprint=
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction , author=. 2026 , eprint=
2026
-
[54]
2026 , eprint=
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=
2026
-
[55]
2025 , eprint=
MiMo-VL Technical Report , author=. 2025 , eprint=
2025
-
[56]
Gemini 3 Pro Model Card , year =
-
[57]
2026 , eprint=
OpenAI GPT-5 System Card , author=. 2026 , eprint=
2026
-
[58]
2026 , month = feb, url =
Gemini 3.1 Pro Model Card , author =. 2026 , month = feb, url =
2026
-
[59]
2026 , month = mar, day =
2026
-
[60]
Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. 2024 , issue_date =. doi:10.1145/3641289 , journal =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.