A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Huangchen Xu; Yi Chang; Yuan Wu

arxiv: 2606.04596 · v1 · pith:27CQYAWUnew · submitted 2026-06-03 · 💻 cs.CL

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Huangchen Xu , Yuan Wu , Yi Chang This is my paper

Pith reviewed 2026-06-28 05:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords positional biasmulti-video summarizationmultimodal large language modelsvideo understandingbenchmark evaluationinput order effects

0 comments

The pith

The position of a video in a multi-video input affects summary quality in MLLMs, with effects that vary by domain and model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how the order of videos in multi-video inputs influences the quality of per-video summaries produced by MLLMs. It constructs a benchmark using ActivityNet and News videos across Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. Nine open-source and proprietary models are tested using three metrics: Coverage, Directional Positional Bias, and Middle-Edge Gap. Results indicate that positional effects depend on the domain and model, that signed directional bias can remain small even when middle positions underperform, and that increasing visual or generation budget does not uniformly remove the imbalance. Prompt-level mitigation methods are also analyzed.

Core claim

Positional effects in multi-video summarization are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance.

What carries the argument

Three complementary metrics—Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG)—applied to a benchmark of two- and four-video inputs from ActivityNet and News videos to measure how input slot changes summary quality for unchanged content.

Load-bearing premise

The chosen ActivityNet and News video clips remain representative of real multi-video inputs and that the three metrics isolate positional bias without confounding effects from video length or content complexity.

What would settle it

An experiment that swaps video positions across identical content sets and checks whether measured differences in summary quality reverse or disappear for the tested models and domains.

Figures

Figures reproduced from arXiv: 2606.04596 by Huangchen Xu, Yi Chang, Yuan Wu.

**Figure 1.** Figure 1: Overview of our evaluation pipeline. We construct cyclic orderings for pairwise and listwise multi-video [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Visual-budget robustness check for Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Boundary-format comparison for InternVL3.5-8B. sition 2 improves while position 4 drops substantially, suggesting that text-only boundaries flatten the curve rather than consistently improving coverage. For Qwen3-VL-8B, removing black frames does not remove the middle-position weakness. As a slot-boundary sanity check, we also compare each generated slot summary with its own reference and with the other… view at source ↗

**Figure 5.** Figure 5: Baseline Coverage–position curves for model group A. Each panel corresponds to one domain, duration, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Baseline Coverage–position curves for model group B. The panel order matches Figure [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Equal-attention prompt Coverage curves across all short four-video domains. Solid lines are baseline prompt-first black-frame runs; dashed lines add the explicit equal-attention instruction. Domain Dur. P Joint Single Gain Cooking S 2 0.3635 0.3113 −5.22 Cooking S 4 0.3092 0.3181 +0.88 Domestic L 2 0.3551 0.3193 −3.59 Domestic L 4 0.3139 0.3148 +0.09 Domestic S 2 0.4591 0.4289 −3.02 Domestic S 4 0.3508 0.3… view at source ↗

**Figure 8.** Figure 8: Qwen3-VL-8B cyclic versus full-permutation Coverage curves. The full-permutation design averages over all order contexts for each target position. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Example 1. The four responses consistently preserve the core ignition-related details (e.g., fire starter and catches fire), while the contextual detail about children watching appears only in Position 1. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Manually verified prompt-placement leakage case. The automatic miner selected the highlighted concrete anchors because they are unsupported by the claimed target reference but supported by another video in the same input group. The human audit labels this case as confirmed cross-video leakage. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical benchmark and three metrics for measuring positional bias in multi-video MLLM summarization, with results that vary by model and domain.

read the letter

The main thing here is an empirical check on how input order affects per-video summary quality when MLLMs handle multiple videos at once. The authors build a benchmark from ActivityNet and News clips across cooking, domestic, leisure, and news domains, run nine models on two- and four-video setups, and track effects with Coverage, Directional Positional Bias, and Middle-Edge Gap.

What is new is the multi-video benchmark itself plus the three metrics applied together to this setting. Prior work on position bias in text LLMs exists, but this extends it to video inputs with systematic comparisons and some prompt mitigation tests. The paper does a reasonable job showing that signed bias can stay small while middle positions still lag, and that extra visual or generation budget does not remove the imbalance across the board.

The soft spots are the lack of any statistical tests, error bars, or sample-size details in the abstract, which makes it hard to judge how stable the domain and model differences really are. The stress-test concern about video length or complexity confounding the position metrics also looks live on the abstract alone; nothing there indicates explicit length matching or content controls beyond domain labels. If the full methods section does not fix that, the metrics may not isolate position cleanly.

This is for people building or deploying MLLMs on multi-video tasks who need to know about order sensitivity. Readers who care about reliability benchmarks will find the setup and the model-dependence findings useful. It is worth sending to peer review because the core measurement question is concrete and the benchmark could be reused, even if the current evidence needs tightening on controls and stats.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a systematic empirical evaluation of positional bias in multi-video summarization tasks performed by nine open-source and proprietary MLLMs. Using a benchmark constructed from ActivityNet and News video clips across Cooking, Domestic, Leisure, and News domains, the authors examine two- and four-video inputs and quantify position effects via three metrics (Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG)). The central claim is that positional effects are domain- and model-dependent, that signed directional bias can remain small even when middle positions underperform, and that increasing visual or generation budgets does not uniformly eliminate the imbalance; prompt-level mitigation strategies are also analyzed.

Significance. If the results hold after addressing methodological gaps, the work is significant as a timely measurement study that documents concrete limitations of current MLLMs when processing multiple videos—an increasingly common input regime. The domain- and model-specific patterns, together with the complementary metrics, supply practitioners with actionable diagnostics and motivate the development of order-invariant multimodal architectures. The purely empirical design avoids circularity and offers falsifiable observations that can be directly replicated or extended.

major comments (2)

[Abstract] Abstract and Methods: The abstract states results on domain- and model-dependence but supplies no statistical tests, error bars, sample sizes, or exclusion criteria. Without these, it is impossible to judge whether the reported effects are robust or whether observed middle-position underperformance reflects sampling variability.
[Evaluation Metrics] Evaluation setup and metric definitions: The central claim requires that Coverage, DPB, and MEG isolate positional bias after permuting identical video sets. If ActivityNet/News clips vary in duration or scene density and position assignment correlates with these properties, the observed imbalances could be driven by content rather than slot. The manuscript must explicitly describe length normalization, content-matched controls, or permutation protocols to rule out this confound.

minor comments (1)

[Abstract] Abstract: Adding a parenthetical note on the total number of video clips or input instances used would help readers gauge the scale of the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on statistical reporting and methodological details.

read point-by-point responses

Referee: [Abstract] Abstract and Methods: The abstract states results on domain- and model-dependence but supplies no statistical tests, error bars, sample sizes, or exclusion criteria. Without these, it is impossible to judge whether the reported effects are robust or whether observed middle-position underperformance reflects sampling variability.

Authors: The abstract serves as a concise overview; full details on sample sizes (50 video sets per domain and input size, each evaluated over 5 random seeds), error bars (standard error in all figures/tables), and exclusion criteria (videos <10s or with transcription failures removed) appear in Section 3 and the appendix. We agree the abstract should signal robustness and will revise it to note 'results averaged over permutations with standard errors'. We will also add Wilcoxon signed-rank tests for position effects to the Methods section. revision: yes
Referee: [Evaluation Metrics] Evaluation setup and metric definitions: The central claim requires that Coverage, DPB, and MEG isolate positional bias after permuting identical video sets. If ActivityNet/News clips vary in duration or scene density and position assignment correlates with these properties, the observed imbalances could be driven by content rather than slot. The manuscript must explicitly describe length normalization, content-matched controls, or permutation protocols to rule out this confound.

Authors: The benchmark explicitly constructs sets from the same domain with comparable durations (clips selected within 20% length variance) and applies random permutations of video order for every trial, ensuring each video appears equally often in every position across the experiment. This isolates positional effects. We will add an expanded 'Benchmark Construction' paragraph detailing the exact permutation protocol, set-matching criteria, and confirmation that no post-hoc length normalization is applied beyond initial selection. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper constructs a benchmark from ActivityNet and News videos, evaluates nine MLLMs on two- and four-video inputs, and reports position effects via three metrics (Coverage, DPB, MEG). No equations, fitted parameters, uniqueness theorems, or self-citations appear in the derivation chain; all claims rest on direct experimental measurements rather than any reduction of outputs to inputs by construction. The central results (domain- and model-dependent positional effects) are therefore independent of the inputs and receive no circularity penalty.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The implicit assumption that position is the sole manipulated variable while content is held constant is treated as a domain assumption.

axioms (1)

domain assumption Video content remains identical when its input position changes.
Stated in the problem definition in the abstract.

pith-pipeline@v0.9.1-grok · 5709 in / 1139 out tokens · 22894 ms · 2026-06-28T05:55:58.083306+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 27 canonical work pages

[1]

Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =

Kansal, Kajal and Kansal, Nikita and Bavana, Sreevaatsav and Vamshi, Bodla Krishna and Goyal, Nidhi , title =. Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =. 2023 , isbn =. doi:10.1145/3607540.3617139 , abstract =

work page doi:10.1145/3607540.3617139 2023
[2]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, Hang and Li, Xin and Bing, Lidong. Video- LL a MA : An Instruction-tuned Audio-Visual Language Model for Video Understanding. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.49

work page doi:10.18653/v1/2023.emnlp-demo.49 2023
[3]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

2025
[4]

MLVU: Benchmarking Multi-task Long Video Understanding , year=

Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Liang, Zhengyang and Xiao, Shitao and Qin, Minghao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng , booktitle=. MLVU: Benchmarking Multi-task Long Video Understanding , year=
[5]

Proceedings of the AAAI conference on artificial intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[6]

QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Jung, Woojun and Kim, Junyeong. QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1340

work page doi:10.18653/v1/2025.findings-emnlp.1340 2025
[7]

Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses

Grenander, Matt and Dong, Yue and Cheung, Jackie Chi Kit and Louis, Annie. Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ...

work page doi:10.18653/v1/d19-1620 2019
[8]

URLhttps://aclanthology.org/2024.tacl-1.9/

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[9]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Identifying and mitigating position bias of multi-image vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[10]

2025 , eprint=

Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models , author=. 2025 , eprint=

2025
[11]

2015 , volume=

Heilbron, Fabian Caba and Escorcia, Victor and Ghanem, Bernard and Niebles, Juan Carlos , booktitle=. 2015 , volume=

2015
[12]

Incorporating Background Knowledge into Video Description Generation

Whitehead, Spencer and Ji, Heng and Bansal, Mohit and Chang, Shih-Fu and Voss, Clare. Incorporating Background Knowledge into Video Description Generation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1433

work page doi:10.18653/v1/d18-1433 2018
[13]

Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects

Schilcher, Patrick and Karasin, Dominik and Sch. Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1124

work page doi:10.18653/v1/2025.findings-emnlp.1124 2025
[14]

P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization

Sun, Xu and Delphin-Poulat, Lionel and Tarnec, Christ \`e le and Shimorina, Anastasia. P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.404

work page doi:10.18653/v1/2025.emnlp-main.404 2025
[15]

European Conference on Computer Vision , year=

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=
[16]

Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan

Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J. Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1656

work page doi:10.18653/v1/p19-1656 2019
[17]

Large Language Models are not Fair Evaluators

Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

work page doi:10.18653/v1/2024.acl-long.511 2024
[18]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004
[19]

METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

2005
[20]

2020 , eprint=

BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

2020
[21]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[22]

Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge

Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush. Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguis...

2025
[23]

2024 , eprint=

Large Language Models are Zero-Shot Rankers for Recommender Systems , author=. 2024 , eprint=

2024
[24]

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =

Zhang, Zhenyu and Chen, Runjin and Liu, Shiwei and Yao, Zhewei and Ruwase, Olatunji and Chen, Beidi and Wu, Xiaoxia and Wang, Zhangyang , booktitle =. Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =. doi:10.52202/079017-1943 , editor =

work page doi:10.52202/079017-1943 1943
[25]

Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

Pennec, Galann and Liu, Zhengyuan and Asher, Nicholas and Muller, Philippe and Chen, Nancy F. Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Comput...

2025
[26]

On Positional Bias of Faithfulness for Long-form Summarization

Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq. On Positional Bias of Faithfulness for Long-form Summarization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.442

work page doi:10.18653/v1/2025.naacl-long.442 2025
[27]

Split and Merge: Aligning Position Biases in LLM -based Evaluators

Li, Zongjie and Wang, Chaozheng and Ma, Pingchuan and Wu, Daoyuan and Wang, Shuai and Gao, Cuiyun and Liu, Yang. Split and Merge: Aligning Position Biases in LLM -based Evaluators. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.621

work page doi:10.18653/v1/2024.emnlp-main.621 2024
[28]

Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

work page doi:10.18653/v1/2024.findings-acl.29 2024
[29]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[30]

GPTS core: Evaluate as You Desire

Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365

work page doi:10.18653/v1/2024.naacl-long.365 2024
[31]

B leu: a Method for Automatic Evaluation of Machine Translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[32]

Evaluating the Factual Consistency of Abstractive Text Summarization

Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard. Evaluating the Factual Consistency of Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.750

work page doi:10.18653/v1/2020.emnlp-main.750 2020
[33]

2025 , eprint=

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. 2025 , eprint=

2025
[34]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[35]

2023 , eprint=

VideoLLM: Modeling Video Sequence with Large Language Models , author=. 2023 , eprint=

2023
[36]

Multimodal Abstractive Summarization for How2 Videos

Palaskar, Shruti and Libovick \'y , Jind r ich and Gella, Spandana and Metze, Florian. Multimodal Abstractive Summarization for How2 Videos. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1659

work page doi:10.18653/v1/p19-1659 2019
[37]

2023 , eprint=

Token Merging: Your ViT But Faster , author=. 2023 , eprint=

2023
[38]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

2025
[39]

2025 , eprint=

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=

2025
[40]

VCSUM : A Versatile C hinese Meeting Summarization Dataset

Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi. VCSUM : A Versatile C hinese Meeting Summarization Dataset. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.377

work page doi:10.18653/v1/2023.findings-acl.377 2023
[41]

Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization

Rahimi, Elahe and Sajjad, Hassan and Rosati, Domenic and Badawi, Abeer and Dolatabadi, Elham and Rudzicz, Frank. Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.846

work page doi:10.18653/v1/2025.findings-emnlp.846 2025
[42]

2024 , eprint=

CIDER: Counterfactual-Invariant Diffusion-based GNN Explainer for Causal Subgraph Inference , author=. 2024 , eprint=

2024
[43]

Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework

Zhang, Shaolei and Feng, Yang. Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.467

work page doi:10.18653/v1/2022.acl-long.467 2022
[44]

Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model

Saito, Kuniaki and Lee, Chen-Yu and Sohn, Kihyuk and Ushiku, Yoshitaka. Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...

work page doi:10.18653/v1/2025.naacl-long.58 2025
[45]

N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Grusky, Max and Naaman, Mor and Artzi, Yoav. N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1065

work page doi:10.18653/v1/n18-1065 2018
[46]

2025 , howpublished =

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum , author =. 2025 , howpublished =

2025
[47]

2018 , month = nov, url =

Many Turn to YouTube for Children's Content, News, How-To Lessons , author =. 2018 , month = nov, url =

2018
[48]

2016 , isbn =

Covington, Paul and Adams, Jay and Sargin, Emre , title =. 2016 , isbn =. doi:10.1145/2959100.2959190 , booktitle =

work page doi:10.1145/2959100.2959190 2016
[49]

2025 , url=

Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu , booktitle=. 2025 , url=

2025
[50]

2026 , eprint=

MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding , author=. 2026 , eprint=

2026
[51]

Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation

Rebuffel, Clement and Scialom, Thomas and Soulier, Laure and Piwowarski, Benjamin and Lamprier, Sylvain and Staiano, Jacopo and Scoutheeten, Geoffrey and Gallinari, Patrick. Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.1...

work page doi:10.18653/v1/2021.emnlp-main.633 2021
[52]

2025 , eprint=

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

2025
[53]

2026 , eprint=

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction , author=. 2026 , eprint=

2026
[54]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

2026
[55]

2025 , eprint=

MiMo-VL Technical Report , author=. 2025 , eprint=

2025
[56]

Gemini 3 Pro Model Card , year =
[57]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

2026
[58]

2026 , month = feb, url =

Gemini 3.1 Pro Model Card , author =. 2026 , month = feb, url =

2026
[59]

2026 , month = mar, day =

2026
[60]

Yu, Qiang Yang, and Xing Xie

Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. 2024 , issue_date =. doi:10.1145/3641289 , journal =

work page doi:10.1145/3641289 2024

[1] [1]

Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =

Kansal, Kajal and Kansal, Nikita and Bavana, Sreevaatsav and Vamshi, Bodla Krishna and Goyal, Nidhi , title =. Proceedings of the 2nd Workshop on User-Centric Narrative Summarization of Long Videos , pages =. 2023 , isbn =. doi:10.1145/3607540.3617139 , abstract =

work page doi:10.1145/3607540.3617139 2023

[2] [2]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, Hang and Li, Xin and Bing, Lidong. Video- LL a MA : An Instruction-tuned Audio-Visual Language Model for Video Understanding. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.49

work page doi:10.18653/v1/2023.emnlp-demo.49 2023

[3] [3]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

2025

[4] [4]

MLVU: Benchmarking Multi-task Long Video Understanding , year=

Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Liang, Zhengyang and Xiao, Shitao and Qin, Minghao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng , booktitle=. MLVU: Benchmarking Multi-task Long Video Understanding , year=

[5] [5]

Proceedings of the AAAI conference on artificial intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[6] [6]

QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Jung, Woojun and Kim, Junyeong. QEVA : A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1340

work page doi:10.18653/v1/2025.findings-emnlp.1340 2025

[7] [7]

Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses

Grenander, Matt and Dong, Yue and Cheung, Jackie Chi Kit and Louis, Annie. Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ...

work page doi:10.18653/v1/d19-1620 2019

[8] [8]

URLhttps://aclanthology.org/2024.tacl-1.9/

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[9] [9]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Identifying and mitigating position bias of multi-image vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[10] [10]

2025 , eprint=

Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models , author=. 2025 , eprint=

2025

[11] [11]

2015 , volume=

Heilbron, Fabian Caba and Escorcia, Victor and Ghanem, Bernard and Niebles, Juan Carlos , booktitle=. 2015 , volume=

2015

[12] [12]

Incorporating Background Knowledge into Video Description Generation

Whitehead, Spencer and Ji, Heng and Bansal, Mohit and Chang, Shih-Fu and Voss, Clare. Incorporating Background Knowledge into Video Description Generation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1433

work page doi:10.18653/v1/d18-1433 2018

[13] [13]

Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects

Schilcher, Patrick and Karasin, Dominik and Sch. Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1124

work page doi:10.18653/v1/2025.findings-emnlp.1124 2025

[14] [14]

P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization

Sun, Xu and Delphin-Poulat, Lionel and Tarnec, Christ \`e le and Shimorina, Anastasia. P o S um-Bench: Benchmarking Position Bias in LLM -based Conversational Summarization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.404

work page doi:10.18653/v1/2025.emnlp-main.404 2025

[15] [15]

European Conference on Computer Vision , year=

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=

[16] [16]

Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan

Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J. Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1656

work page doi:10.18653/v1/p19-1656 2019

[17] [17]

Large Language Models are not Fair Evaluators

Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

work page doi:10.18653/v1/2024.acl-long.511 2024

[18] [18]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004

[19] [19]

METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

2005

[20] [20]

2020 , eprint=

BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

2020

[21] [21]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023

[22] [22]

Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge

Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush. Judging the Judges: A Systematic Study of Position Bias in LLM -as-a-Judge. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguis...

2025

[23] [23]

2024 , eprint=

Large Language Models are Zero-Shot Rankers for Recommender Systems , author=. 2024 , eprint=

2024

[24] [24]

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =

Zhang, Zhenyu and Chen, Runjin and Liu, Shiwei and Yao, Zhewei and Ruwase, Olatunji and Chen, Beidi and Wu, Xiaoxia and Wang, Zhangyang , booktitle =. Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , url =. doi:10.52202/079017-1943 , editor =

work page doi:10.52202/079017-1943 1943

[25] [25]

Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

Pennec, Galann and Liu, Zhengyuan and Asher, Nicholas and Muller, Philippe and Chen, Nancy F. Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Comput...

2025

[26] [26]

On Positional Bias of Faithfulness for Long-form Summarization

Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq. On Positional Bias of Faithfulness for Long-form Summarization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.442

work page doi:10.18653/v1/2025.naacl-long.442 2025

[27] [27]

Split and Merge: Aligning Position Biases in LLM -based Evaluators

Li, Zongjie and Wang, Chaozheng and Ma, Pingchuan and Wu, Daoyuan and Wang, Shuai and Gao, Cuiyun and Liu, Yang. Split and Merge: Aligning Position Biases in LLM -based Evaluators. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.621

work page doi:10.18653/v1/2024.emnlp-main.621 2024

[28] [28]

Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

work page doi:10.18653/v1/2024.findings-acl.29 2024

[29] [29]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[30] [30]

GPTS core: Evaluate as You Desire

Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365

work page doi:10.18653/v1/2024.naacl-long.365 2024

[31] [31]

B leu: a Method for Automatic Evaluation of Machine Translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[32] [32]

Evaluating the Factual Consistency of Abstractive Text Summarization

Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard. Evaluating the Factual Consistency of Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.750

work page doi:10.18653/v1/2020.emnlp-main.750 2020

[33] [33]

2025 , eprint=

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. 2025 , eprint=

2025

[34] [34]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[35] [35]

2023 , eprint=

VideoLLM: Modeling Video Sequence with Large Language Models , author=. 2023 , eprint=

2023

[36] [36]

Multimodal Abstractive Summarization for How2 Videos

Palaskar, Shruti and Libovick \'y , Jind r ich and Gella, Spandana and Metze, Florian. Multimodal Abstractive Summarization for How2 Videos. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1659

work page doi:10.18653/v1/p19-1659 2019

[37] [37]

2023 , eprint=

Token Merging: Your ViT But Faster , author=. 2023 , eprint=

2023

[38] [38]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

2025

[39] [39]

2025 , eprint=

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=

2025

[40] [40]

VCSUM : A Versatile C hinese Meeting Summarization Dataset

Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi. VCSUM : A Versatile C hinese Meeting Summarization Dataset. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.377

work page doi:10.18653/v1/2023.findings-acl.377 2023

[41] [41]

Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization

Rahimi, Elahe and Sajjad, Hassan and Rosati, Domenic and Badawi, Abeer and Dolatabadi, Elham and Rudzicz, Frank. Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.846

work page doi:10.18653/v1/2025.findings-emnlp.846 2025

[42] [42]

2024 , eprint=

CIDER: Counterfactual-Invariant Diffusion-based GNN Explainer for Causal Subgraph Inference , author=. 2024 , eprint=

2024

[43] [43]

Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework

Zhang, Shaolei and Feng, Yang. Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.467

work page doi:10.18653/v1/2022.acl-long.467 2022

[44] [44]

Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model

Saito, Kuniaki and Lee, Chen-Yu and Sohn, Kihyuk and Ushiku, Yoshitaka. Where is the answer? An empirical study of positional bias for parametric knowledge extraction in language model. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...

work page doi:10.18653/v1/2025.naacl-long.58 2025

[45] [45]

N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Grusky, Max and Naaman, Mor and Artzi, Yoav. N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1065

work page doi:10.18653/v1/n18-1065 2018

[46] [46]

2025 , howpublished =

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum , author =. 2025 , howpublished =

2025

[47] [47]

2018 , month = nov, url =

Many Turn to YouTube for Children's Content, News, How-To Lessons , author =. 2018 , month = nov, url =

2018

[48] [48]

2016 , isbn =

Covington, Paul and Adams, Jay and Sargin, Emre , title =. 2016 , isbn =. doi:10.1145/2959100.2959190 , booktitle =

work page doi:10.1145/2959100.2959190 2016

[49] [49]

2025 , url=

Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu , booktitle=. 2025 , url=

2025

[50] [50]

2026 , eprint=

MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding , author=. 2026 , eprint=

2026

[51] [51]

Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation

Rebuffel, Clement and Scialom, Thomas and Soulier, Laure and Piwowarski, Benjamin and Lamprier, Sylvain and Staiano, Jacopo and Scoutheeten, Geoffrey and Gallinari, Patrick. Data- Q uest E val: A Referenceless Metric for Data-to-Text Semantic Evaluation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.1...

work page doi:10.18653/v1/2021.emnlp-main.633 2021

[52] [52]

2025 , eprint=

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

2025

[53] [53]

2026 , eprint=

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction , author=. 2026 , eprint=

2026

[54] [54]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

2026

[55] [55]

2025 , eprint=

MiMo-VL Technical Report , author=. 2025 , eprint=

2025

[56] [56]

Gemini 3 Pro Model Card , year =

[57] [57]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

2026

[58] [58]

2026 , month = feb, url =

Gemini 3.1 Pro Model Card , author =. 2026 , month = feb, url =

2026

[59] [59]

2026 , month = mar, day =

2026

[60] [60]

Yu, Qiang Yang, and Xing Xie

Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. 2024 , issue_date =. doi:10.1145/3641289 , journal =

work page doi:10.1145/3641289 2024