pith. sign in

arxiv: 2606.19338 · v1 · pith:VBDEKFGInew · submitted 2026-06-17 · 💻 cs.CV

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Pith reviewed 2026-06-26 21:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords RNG-Benchmultimodal large language modelsnon-Markov gamesmemory reconstructionMemory Gap metricfine-tuning transfer3D Mazeagent evaluation
0
0 comments X

The pith

Multimodal models fail mostly by forgetting past observations in non-Markov games rather than making poor decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RNG-Bench to test how well multimodal large language models reconstruct and act on observations no longer visible during ongoing interactions. It uses two games, Matching Pairs and 3D Maze, under controlled axes of difficulty to isolate hidden-state reconstruction from other skills. A Memory Gap metric and duel protocol separate forgetting from suboptimal choices, revealing that most errors trace to lost earlier inputs. The hardest settings demand 128K-token contexts and hundreds of images, where frontier models still fall short. Fine-tuning on optimal demonstrations raises scores on this benchmark and transfers to others without harming general capabilities.

Core claim

RNG-Bench isolates a base model's ability to reconstruct past observations and act on them during multi-step interaction through Matching Pairs and 3D Maze games evaluated under three controlled difficulty axes, with a head-to-head duel protocol and Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode and remain far from saturated by frontier MLLMs, while Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstratio

What carries the argument

RNG-Bench benchmark suite with its Memory Gap metric that disentangles forgetting from poor action selection, evaluated via head-to-head duel protocol across two games.

If this is right

  • Hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode.
  • Most residual errors stem from forgetting earlier observations rather than suboptimal decision making.
  • Frontier MLLMs remain far from saturated on RNG-Bench.
  • Fine-tuning on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench.
  • Fine-tuning transfers to existing benchmarks without degrading general multimodal capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that retain visual history across hundreds of steps could close the observed performance gap more directly than further context scaling.
  • The duel protocol and Memory Gap separation could be adapted to isolate memory effects in other agent benchmarks that currently mix multiple skills.
  • Transfer after fine-tuning suggests the benchmark measures a generalizable reconstructive skill rather than game-specific memorization.

Load-bearing premise

The two games and Memory Gap metric successfully isolate reconstruction of past observations from other agent skills.

What would settle it

A model achieving near-zero Memory Gap scores on the hardest configurations while still producing high error rates traceable to decision errors in separate non-memory tasks would indicate the metric does not isolate forgetting as claimed.

read the original abstract

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RNG-Bench, a benchmark suite with Matching Pairs and 3D Maze games, to evaluate MLLMs on reconstructing and acting upon past observations in controllable non-Markov environments. It defines three difficulty axes (grid size, visual pattern, observation modality), a head-to-head duel protocol, and a Memory Gap metric intended to separate forgetting from suboptimal action selection. Experiments show frontier models remain far from saturation at scales of ~128K tokens and 350 images per episode, attribute most residual errors to forgetting, and report that fine-tuning Qwen3.5-9B on optimal-policy rollouts improves RNG-Bench performance while transferring to other benchmarks without degrading general multimodal capability.

Significance. If the Memory Gap metric validly isolates reconstruction failure, the benchmark would provide a useful diagnostic tool for memory limitations in closed-loop multimodal agents and a concrete path for improvement via targeted fine-tuning. The controlled axes and duel protocol are methodologically positive for reducing instance variance. The transfer results, if robust, would strengthen the practical value of the work.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Memory Gap metric): The central claim that 'most residual errors stem from forgetting earlier observations rather than from suboptimal decision making' depends on the metric successfully disentangling the two factors. The manuscript states the benchmark is 'designed to isolate' reconstruction 'without conflating hidden-state reconstruction with other agent skills' but supplies no derivation, game-theoretic argument, or ablation showing that optimal action sequences exist independently of accurate reconstruction of hidden observations in Matching Pairs or 3D Maze. If every optimal action requires exact recall of card identities or spatial layout, the attribution is circular rather than diagnostic.
  2. [§4] §4 (evaluation results): The statement that the hardest configurations 'remain far from saturated by frontier MLLMs' and the Memory Gap analysis are presented without reported error bars, number of independent runs, or statistical tests on the residual-error breakdown. This makes it impossible to assess whether the reported dominance of forgetting is robust or sensitive to sampling variance.
minor comments (2)
  1. [Abstract] The abstract lists three controlled difficulty axes but does not enumerate the concrete parameter values used for grid size, visual pattern, and modality; these should be tabulated in §3 for reproducibility.
  2. [§3] Notation for the Memory Gap metric (e.g., how the 'gap' is computed from reconstruction-conditioned vs. unconditional policy error) should be formalized with an equation rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on RNG-Bench and the Memory Gap metric. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Memory Gap metric): The central claim that 'most residual errors stem from forgetting earlier observations rather than from suboptimal decision making' depends on the metric successfully disentangling the two factors. The manuscript states the benchmark is 'designed to isolate' reconstruction 'without conflating hidden-state reconstruction with other agent skills' but supplies no derivation, game-theoretic argument, or ablation showing that optimal action sequences exist independently of accurate reconstruction of hidden observations in Matching Pairs or 3D Maze. If every optimal action requires exact recall of card identities or spatial layout, the attribution is circular rather than diagnostic.

    Authors: We agree that the current manuscript lacks an explicit derivation or ablation demonstrating independence between optimal action selection and full hidden-state reconstruction. The Memory Gap is computed by comparing model performance against an oracle policy that receives the complete observation history, but we did not include a formal argument showing that partial information suffices for some optimal moves in either game. We will add an appendix with (i) a game-theoretic definition of the metric, (ii) a proof sketch that optimal policies in Matching Pairs can exploit partial matches without full recall, and (iii) an ablation that perturbs reconstruction accuracy while holding action selection fixed. This addresses the circularity concern directly. revision: yes

  2. Referee: [§4] §4 (evaluation results): The statement that the hardest configurations 'remain far from saturated by frontier MLLMs' and the Memory Gap analysis are presented without reported error bars, number of independent runs, or statistical tests on the residual-error breakdown. This makes it impossible to assess whether the reported dominance of forgetting is robust or sensitive to sampling variance.

    Authors: The referee is correct that the submitted version omits error bars, run counts, and statistical tests for the Memory Gap breakdown. We will revise §4 to report results over five independent random seeds per configuration, include standard-error bars on all bar plots, and add paired t-tests (with p-values) comparing the forgetting versus decision-making error components to establish that the dominance of forgetting is statistically significant rather than an artifact of sampling variance. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and metric are independently defined

full rationale

The paper defines RNG-Bench, its two games, the duel protocol, and Memory Gap metric from first principles in the abstract and introduction to isolate reconstruction from other skills. No equations, fitted parameters, or self-citations are shown that reduce the Memory Gap attribution or performance claims to quantities defined by the authors' own prior work or by construction. The central empirical finding (most errors from forgetting) is presented as an outcome of running the benchmark on frontier models rather than a definitional tautology. The derivation chain is therefore self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces a benchmark and metric without additional fitted parameters or unstated mathematical axioms beyond the design goal of isolating reconstruction.

axioms (1)
  • domain assumption The proposed games isolate reconstruction ability from other agent skills
    Stated as the design intent in the abstract.
invented entities (2)
  • RNG-Bench no independent evidence
    purpose: Benchmark suite for non-Markov reconstruction in MLLMs
    Newly introduced in the paper.
  • Memory Gap metric no independent evidence
    purpose: Disentangle forgetting from suboptimal action selection
    Newly introduced in the paper.

pith-pipeline@v0.9.1-grok · 5799 in / 1357 out tokens · 28838 ms · 2026-06-26T21:15:03.810123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 20 linked inside Pith

  1. [1]

    Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  3. [3]

    L-eval: Instituting standardized evaluation for long context language models

    Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14388–14411, 2024

  4. [4]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

  5. [5]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

  6. [6]

    Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

  7. [7]

    Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026

    ByteDance Seed. Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026. URLhttps://seed.bytedance.com/en/seed2

  8. [8]

    clembench: Using game play to evaluate chat-optimized language models as conversational agents

    Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219, 2023

  9. [9]

    SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  10. [10]

    ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 10 Beyond the Current Observ...

  11. [11]

    Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

    Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

  12. [12]

    Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

    Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

  13. [13]

    EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026

    Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, and Yuhang Zang. EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026

  14. [14]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023

  15. [15]

    MM-IFEngine: Towards multimodal instruction following

    Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  16. [16]

    ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning

    Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jiaqi Liang, et al. ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  17. [17]

    WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026

    Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, et al. WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026

  18. [18]

    InternLM- XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, HaodongDuan, MaosongCao, WenweiZhang, YiningLi, HangYan, YangGao, XinyueZhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. InternLM- XComposer2: Mastering free-form text-image composition and comprehension ...

  19. [19]

    VLMEvalKit: An open-source toolkit for evaluating large multi-modality models

    HaodongDuan, XinyuFang, JunmingYang, XiangyuZhao, YuxuanQiao, MoLi, AmitAgarwal, ZheChen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi ...

  20. [20]

    GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024

    Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024

  21. [21]

    Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024

  22. [22]

    Creation-mmbench: Assessing context-aware creative intelligence in mllms

    Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, and Dahua Lin. Creation-mmbench: Assessing context-aware creative intelligence in mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 447–456, October 2025

  23. [23]

    Gemini 3.1 Pro model card, February 2026

    Google DeepMind. Gemini 3.1 Pro model card, February 2026. URLhttps://deepmind.google/ models/model-cards/gemini-3-1-pro/

  24. [24]

    TextArena.arXiv preprint arXiv:2504.11442, 2025

    Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. TextArena.arXiv preprint arXiv:2504.11442, 2025. 11 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

  25. [25]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  26. [26]

    Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  27. [27]

    lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Hao- jian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

  28. [28]

    Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

  29. [29]

    Smith, and Ranjay Krishna

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  30. [30]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

  31. [31]

    Littman, and Anthony R

    Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

  32. [32]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022

  33. [33]

    M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models

    Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Yuxin Jiang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  34. [34]

    Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, DanielleEpstein, IlliaPolosukhin, MatthewKelcey, JacobDevlin, KentonLee, KristinaN.Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transactions of...

  35. [35]

    Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025

    Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025

  36. [36]

    Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026

    Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026

  37. [37]

    VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

  38. [38]

    OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

    Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen. OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

  39. [39]

    NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025

    Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025

  40. [40]

    Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026

    Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, and Aixin Sun. Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026. 12 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

  41. [41]

    AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023

    Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023

  42. [42]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  43. [43]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  44. [44]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024

  45. [45]

    Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  46. [46]

    STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence

    Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, and Jiaqi Wang. STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence. InInternational Conference on Learning Representations (ICLR), 2026

  47. [47]

    MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024

    Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024

  48. [48]

    Visual-RFT: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2034–2044, 2025

  49. [49]

    SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025

    ZiyuLiu, YuhangZang, ShengyuanDing, YuhangCao, XiaoyiDong, HaodongDuan, DahuaLin, andJiaqi Wang. SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025

  50. [50]

    MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models

    Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models. InInternational Conference on Learning Representations (ICLR), 2025

  51. [51]

    Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026

    Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, and Yuhang Zang. Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026

  52. [52]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

  53. [53]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  54. [54]

    Rossi, Seunghyun Yoon, and Hinrich Schütze

    Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schütze. Nolima: Long-context evaluation beyond literal matching. InInternational Conference on Machine Learning, 2025

  55. [55]

    Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025

    ShenNie,FengqiZhu,ZebinYou,etal. Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025. 13 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

  56. [56]

    Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

  57. [57]

    Introducing gpt-5.4, March 2026

    OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

  58. [58]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  59. [59]

    Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026

    Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026

  60. [60]

    Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

    Davide Paglieri, Bartlomiej Cupial, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker- Holder, and Tim Rocktaschel. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

  61. [61]

    Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark

    Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. InInternational Conference on Machine Learning, 2023

  62. [62]

    Kilt: a benchmark for knowledge intensive language tasks

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Associati...

  63. [63]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/ blog?id=qwen3.5

  64. [64]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  65. [65]

    Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

    Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

  66. [66]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

  67. [67]

    Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026

    Alina Shutova, Alexandra Olenina, Ivan Vinogradov, and Anton Sinitsin. Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026

  68. [68]

    A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

  69. [69]

    SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

  70. [70]

    Membench: Towards more comprehensive evaluation on the memory of llm-based agents

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025. 14 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

  71. [71]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

  72. [72]

    Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  73. [73]

    Ada-leval: Evaluating long-context llms with length-adaptable benchmarks

    Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, and Kai Chen. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3712–3724, 2024

  74. [74]

    Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models

    Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. InProceedings of the 2025 Conference of the Nations of the Americas 15 Beyond the Current Observation: Evaluating Multimodal Large La...

  75. [75]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

  76. [76]

    VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025

    Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025

  77. [77]

    SIM-CoT: Supervised implicit chain-of-thought

    Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. SIM-CoT: Supervised implicit chain-of-thought. InInternational Conference on Learning Representations, 2026

  78. [78]

    Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  79. [79]

    LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

  80. [80]

    Mitchell, and Yuanzhi Li

    Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. SmartPlay: A benchmark for LLMs as intelligent agents. InInternational Conference on Learning Representations, 2024

Showing first 80 references.