pith. sign in

arxiv: 2606.04596 · v1 · pith:27CQYAWUnew · submitted 2026-06-03 · 💻 cs.CL

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Pith reviewed 2026-06-28 05:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords positional biasmulti-video summarizationmultimodal large language modelsvideo understandingbenchmark evaluationinput order effects
0
0 comments X

The pith

The position of a video in a multi-video input affects summary quality in MLLMs, with effects that vary by domain and model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how the order of videos in multi-video inputs influences the quality of per-video summaries produced by MLLMs. It constructs a benchmark using ActivityNet and News videos across Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. Nine open-source and proprietary models are tested using three metrics: Coverage, Directional Positional Bias, and Middle-Edge Gap. Results indicate that positional effects depend on the domain and model, that signed directional bias can remain small even when middle positions underperform, and that increasing visual or generation budget does not uniformly remove the imbalance. Prompt-level mitigation methods are also analyzed.

Core claim

Positional effects in multi-video summarization are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance.

What carries the argument

Three complementary metrics—Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG)—applied to a benchmark of two- and four-video inputs from ActivityNet and News videos to measure how input slot changes summary quality for unchanged content.

Load-bearing premise

The chosen ActivityNet and News video clips remain representative of real multi-video inputs and that the three metrics isolate positional bias without confounding effects from video length or content complexity.

What would settle it

An experiment that swaps video positions across identical content sets and checks whether measured differences in summary quality reverse or disappear for the tested models and domains.

Figures

Figures reproduced from arXiv: 2606.04596 by Huangchen Xu, Yi Chang, Yuan Wu.

Figure 1
Figure 1. Figure 1: Overview of our evaluation pipeline. We construct cyclic orderings for pairwise and listwise multi-video [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual-budget robustness check for Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Boundary-format comparison for InternVL3.5-8B. sition 2 improves while position 4 drops substan￾tially, suggesting that text-only boundaries flatten the curve rather than consistently improving cov￾erage. For Qwen3-VL-8B, removing black frames does not remove the middle-position weakness. As a slot-boundary sanity check, we also com￾pare each generated slot summary with its own reference and with the other… view at source ↗
Figure 5
Figure 5. Figure 5: Baseline Coverage–position curves for model group A. Each panel corresponds to one domain, duration, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Baseline Coverage–position curves for model group B. The panel order matches Figure [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Equal-attention prompt Coverage curves across all short four-video domains. Solid lines are baseline prompt-first black-frame runs; dashed lines add the explicit equal-attention instruction. Domain Dur. P Joint Single Gain Cooking S 2 0.3635 0.3113 −5.22 Cooking S 4 0.3092 0.3181 +0.88 Domestic L 2 0.3551 0.3193 −3.59 Domestic L 4 0.3139 0.3148 +0.09 Domestic S 2 0.4591 0.4289 −3.02 Domestic S 4 0.3508 0.3… view at source ↗
Figure 8
Figure 8. Figure 8: Qwen3-VL-8B cyclic versus full-permutation Coverage curves. The full-permutation design averages over all order contexts for each target position. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example 1. The four responses consistently preserve the core ignition-related details (e.g., fire starter and catches fire), while the contextual detail about children watching appears only in Position 1. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Manually verified prompt-placement leakage case. The automatic miner selected the highlighted concrete anchors because they are unsupported by the claimed target reference but supported by another video in the same input group. The human audit labels this case as confirmed cross-video leakage. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a systematic empirical evaluation of positional bias in multi-video summarization tasks performed by nine open-source and proprietary MLLMs. Using a benchmark constructed from ActivityNet and News video clips across Cooking, Domestic, Leisure, and News domains, the authors examine two- and four-video inputs and quantify position effects via three metrics (Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG)). The central claim is that positional effects are domain- and model-dependent, that signed directional bias can remain small even when middle positions underperform, and that increasing visual or generation budgets does not uniformly eliminate the imbalance; prompt-level mitigation strategies are also analyzed.

Significance. If the results hold after addressing methodological gaps, the work is significant as a timely measurement study that documents concrete limitations of current MLLMs when processing multiple videos—an increasingly common input regime. The domain- and model-specific patterns, together with the complementary metrics, supply practitioners with actionable diagnostics and motivate the development of order-invariant multimodal architectures. The purely empirical design avoids circularity and offers falsifiable observations that can be directly replicated or extended.

major comments (2)
  1. [Abstract] Abstract and Methods: The abstract states results on domain- and model-dependence but supplies no statistical tests, error bars, sample sizes, or exclusion criteria. Without these, it is impossible to judge whether the reported effects are robust or whether observed middle-position underperformance reflects sampling variability.
  2. [Evaluation Metrics] Evaluation setup and metric definitions: The central claim requires that Coverage, DPB, and MEG isolate positional bias after permuting identical video sets. If ActivityNet/News clips vary in duration or scene density and position assignment correlates with these properties, the observed imbalances could be driven by content rather than slot. The manuscript must explicitly describe length normalization, content-matched controls, or permutation protocols to rule out this confound.
minor comments (1)
  1. [Abstract] Abstract: Adding a parenthetical note on the total number of video clips or input instances used would help readers gauge the scale of the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on statistical reporting and methodological details.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: The abstract states results on domain- and model-dependence but supplies no statistical tests, error bars, sample sizes, or exclusion criteria. Without these, it is impossible to judge whether the reported effects are robust or whether observed middle-position underperformance reflects sampling variability.

    Authors: The abstract serves as a concise overview; full details on sample sizes (50 video sets per domain and input size, each evaluated over 5 random seeds), error bars (standard error in all figures/tables), and exclusion criteria (videos <10s or with transcription failures removed) appear in Section 3 and the appendix. We agree the abstract should signal robustness and will revise it to note 'results averaged over permutations with standard errors'. We will also add Wilcoxon signed-rank tests for position effects to the Methods section. revision: yes

  2. Referee: [Evaluation Metrics] Evaluation setup and metric definitions: The central claim requires that Coverage, DPB, and MEG isolate positional bias after permuting identical video sets. If ActivityNet/News clips vary in duration or scene density and position assignment correlates with these properties, the observed imbalances could be driven by content rather than slot. The manuscript must explicitly describe length normalization, content-matched controls, or permutation protocols to rule out this confound.

    Authors: The benchmark explicitly constructs sets from the same domain with comparable durations (clips selected within 20% length variance) and applies random permutations of video order for every trial, ensuring each video appears equally often in every position across the experiment. This isolates positional effects. We will add an expanded 'Benchmark Construction' paragraph detailing the exact permutation protocol, set-matching criteria, and confirmation that no post-hoc length normalization is applied beyond initial selection. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper constructs a benchmark from ActivityNet and News videos, evaluates nine MLLMs on two- and four-video inputs, and reports position effects via three metrics (Coverage, DPB, MEG). No equations, fitted parameters, uniqueness theorems, or self-citations appear in the derivation chain; all claims rest on direct experimental measurements rather than any reduction of outputs to inputs by construction. The central results (domain- and model-dependent positional effects) are therefore independent of the inputs and receive no circularity penalty.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The implicit assumption that position is the sole manipulated variable while content is held constant is treated as a domain assumption.

axioms (1)
  • domain assumption Video content remains identical when its input position changes.
    Stated in the problem definition in the abstract.

pith-pipeline@v0.9.1-grok · 5709 in / 1139 out tokens · 22894 ms · 2026-06-28T05:55:58.083306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 27 canonical work pages

  1. [1]

    Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =

    Kansal, Kajal and Kansal, Nikita and Bavana, Sreevaatsav and Vamshi, Bodla Krishna and Goyal, Nidhi , title =. Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =. 2023 , isbn =. doi:10.1145/3607540.3617139 , abstract =

  2. [2]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Zhang, Hang and Li, Xin and Bing, Lidong. Video- LL a MA : An Instruction-tuned Audio-Visual Language Model for Video Understanding. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.49

  3. [3]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  4. [4]

    MLVU: Benchmarking Multi-task Long Video Understanding , year=

    Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Liang, Zhengyang and Xiao, Shitao and Qin, Minghao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng , booktitle=. MLVU: Benchmarking Multi-task Long Video Understanding , year=

  5. [5]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  6. [6]

    QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

    Jung, Woojun and Kim, Junyeong. QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1340

  7. [7]

    Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses

    Grenander, Matt and Dong, Yue and Cheung, Jackie Chi Kit and Louis, Annie. Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ...

  8. [8]

    URLhttps://aclanthology.org/2024.tacl-1.9/

    Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

  9. [9]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Identifying and mitigating position bias of multi-image vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  10. [10]

    2025 , eprint=

    Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models , author=. 2025 , eprint=

  11. [11]

    2015 , volume=

    Heilbron, Fabian Caba and Escorcia, Victor and Ghanem, Bernard and Niebles, Juan Carlos , booktitle=. 2015 , volume=

  12. [12]

    Incorporating Background Knowledge into Video Description Generation

    Whitehead, Spencer and Ji, Heng and Bansal, Mohit and Chang, Shih-Fu and Voss, Clare. Incorporating Background Knowledge into Video Description Generation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1433

  13. [13]

    Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects

    Schilcher, Patrick and Karasin, Dominik and Sch. Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1124

  14. [14]

    P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization

    Sun, Xu and Delphin-Poulat, Lionel and Tarnec, Christ \`e le and Shimorina, Anastasia. P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.404

  15. [15]

    European Conference on Computer Vision , year=

    LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=

  16. [16]

    Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan

    Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J. Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1656

  17. [17]

    Large Language Models are not Fair Evaluators

    Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

  18. [18]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  19. [19]

    METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

    Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

  20. [20]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

  21. [21]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  22. [22]

    Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge

    Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush. Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguis...

  23. [23]

    2024 , eprint=

    Large Language Models are Zero-Shot Rankers for Recommender Systems , author=. 2024 , eprint=

  24. [24]

    Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =

    Zhang, Zhenyu and Chen, Runjin and Liu, Shiwei and Yao, Zhewei and Ruwase, Olatunji and Chen, Beidi and Wu, Xiaoxia and Wang, Zhangyang , booktitle =. Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =. doi:10.52202/079017-1943 , editor =

  25. [25]

    Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

    Pennec, Galann and Liu, Zhengyuan and Asher, Nicholas and Muller, Philippe and Chen, Nancy F. Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Comput...

  26. [26]

    On Positional Bias of Faithfulness for Long-form Summarization

    Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq. On Positional Bias of Faithfulness for Long-form Summarization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.442

  27. [27]

    Split and Merge: Aligning Position Biases in LLM -based Evaluators

    Li, Zongjie and Wang, Chaozheng and Ma, Pingchuan and Wu, Daoyuan and Wang, Shuai and Gao, Cuiyun and Liu, Yang. Split and Merge: Aligning Position Biases in LLM -based Evaluators. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.621

  28. [28]

    Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

    Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

  29. [29]

    2023 , address =

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

  30. [30]

    GPTS core: Evaluate as You Desire

    Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365

  31. [31]

    B leu: a Method for Automatic Evaluation of Machine Translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  32. [32]

    Evaluating the Factual Consistency of Abstractive Text Summarization

    Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard. Evaluating the Factual Consistency of Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.750

  33. [33]

    2025 , eprint=

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. 2025 , eprint=

  34. [34]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Identifying and Mitigating Position Bias of Multi-image Vision-Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  35. [35]

    2023 , eprint=

    VideoLLM: Modeling Video Sequence with Large Language Models , author=. 2023 , eprint=

  36. [36]

    Multimodal Abstractive Summarization for How2 Videos

    Palaskar, Shruti and Libovick \'y , Jind r ich and Gella, Spandana and Metze, Florian. Multimodal Abstractive Summarization for How2 Videos. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1659

  37. [37]

    2023 , eprint=

    Token Merging: Your ViT But Faster , author=. 2023 , eprint=

  38. [38]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=

  40. [40]

    VCSUM : A Versatile C hinese Meeting Summarization Dataset

    Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi. VCSUM : A Versatile C hinese Meeting Summarization Dataset. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.377

  41. [41]

    Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization

    Rahimi, Elahe and Sajjad, Hassan and Rosati, Domenic and Badawi, Abeer and Dolatabadi, Elham and Rudzicz, Frank. Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.846

  42. [42]

    2024 , eprint=

    CIDER: Counterfactual-Invariant Diffusion-based GNN Explainer for Causal Subgraph Inference , author=. 2024 , eprint=

  43. [43]

    Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework

    Zhang, Shaolei and Feng, Yang. Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.467

  44. [44]

    Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model

    Saito, Kuniaki and Lee, Chen-Yu and Sohn, Kihyuk and Ushiku, Yoshitaka. Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...

  45. [45]

    N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

    Grusky, Max and Naaman, Mor and Artzi, Yoav. N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1065

  46. [46]

    2025 , howpublished =

    GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum , author =. 2025 , howpublished =

  47. [47]

    2018 , month = nov, url =

    Many Turn to YouTube for Children's Content, News, How-To Lessons , author =. 2018 , month = nov, url =

  48. [48]

    2016 , isbn =

    Covington, Paul and Adams, Jay and Sargin, Emre , title =. 2016 , isbn =. doi:10.1145/2959100.2959190 , booktitle =

  49. [49]

    2025 , url=

    Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu , booktitle=. 2025 , url=

  50. [50]

    2026 , eprint=

    MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding , author=. 2026 , eprint=

  51. [51]

    Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation

    Rebuffel, Clement and Scialom, Thomas and Soulier, Laure and Piwowarski, Benjamin and Lamprier, Sylvain and Staiano, Jacopo and Scoutheeten, Geoffrey and Gallinari, Patrick. Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.1...

  52. [52]

    2025 , eprint=

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

  53. [53]

    2026 , eprint=

    MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction , author=. 2026 , eprint=

  54. [54]

    2026 , eprint=

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

  55. [55]

    2025 , eprint=

    MiMo-VL Technical Report , author=. 2025 , eprint=

  56. [56]

    Gemini 3 Pro Model Card , year =

  57. [57]

    2026 , eprint=

    OpenAI GPT-5 System Card , author=. 2026 , eprint=

  58. [58]

    2026 , month = feb, url =

    Gemini 3.1 Pro Model Card , author =. 2026 , month = feb, url =

  59. [59]

    2026 , month = mar, day =

  60. [60]

    and Yang, Qiang and Xie, Xing , title =

    Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. 2024 , issue_date =. doi:10.1145/3641289 , journal =