Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
Pith reviewed 2026-06-26 21:15 UTC · model grok-4.3
The pith
Multimodal models fail mostly by forgetting past observations in non-Markov games rather than making poor decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RNG-Bench isolates a base model's ability to reconstruct past observations and act on them during multi-step interaction through Matching Pairs and 3D Maze games evaluated under three controlled difficulty axes, with a head-to-head duel protocol and Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode and remain far from saturated by frontier MLLMs, while Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstratio
What carries the argument
RNG-Bench benchmark suite with its Memory Gap metric that disentangles forgetting from poor action selection, evaluated via head-to-head duel protocol across two games.
If this is right
- Hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode.
- Most residual errors stem from forgetting earlier observations rather than suboptimal decision making.
- Frontier MLLMs remain far from saturated on RNG-Bench.
- Fine-tuning on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench.
- Fine-tuning transfers to existing benchmarks without degrading general multimodal capability.
Where Pith is reading between the lines
- Architectures that retain visual history across hundreds of steps could close the observed performance gap more directly than further context scaling.
- The duel protocol and Memory Gap separation could be adapted to isolate memory effects in other agent benchmarks that currently mix multiple skills.
- Transfer after fine-tuning suggests the benchmark measures a generalizable reconstructive skill rather than game-specific memorization.
Load-bearing premise
The two games and Memory Gap metric successfully isolate reconstruction of past observations from other agent skills.
What would settle it
A model achieving near-zero Memory Gap scores on the hardest configurations while still producing high error rates traceable to decision errors in separate non-memory tasks would indicate the metric does not isolate forgetting as claimed.
read the original abstract
Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RNG-Bench, a benchmark suite with Matching Pairs and 3D Maze games, to evaluate MLLMs on reconstructing and acting upon past observations in controllable non-Markov environments. It defines three difficulty axes (grid size, visual pattern, observation modality), a head-to-head duel protocol, and a Memory Gap metric intended to separate forgetting from suboptimal action selection. Experiments show frontier models remain far from saturation at scales of ~128K tokens and 350 images per episode, attribute most residual errors to forgetting, and report that fine-tuning Qwen3.5-9B on optimal-policy rollouts improves RNG-Bench performance while transferring to other benchmarks without degrading general multimodal capability.
Significance. If the Memory Gap metric validly isolates reconstruction failure, the benchmark would provide a useful diagnostic tool for memory limitations in closed-loop multimodal agents and a concrete path for improvement via targeted fine-tuning. The controlled axes and duel protocol are methodologically positive for reducing instance variance. The transfer results, if robust, would strengthen the practical value of the work.
major comments (2)
- [Abstract and §3] Abstract and §3 (Memory Gap metric): The central claim that 'most residual errors stem from forgetting earlier observations rather than from suboptimal decision making' depends on the metric successfully disentangling the two factors. The manuscript states the benchmark is 'designed to isolate' reconstruction 'without conflating hidden-state reconstruction with other agent skills' but supplies no derivation, game-theoretic argument, or ablation showing that optimal action sequences exist independently of accurate reconstruction of hidden observations in Matching Pairs or 3D Maze. If every optimal action requires exact recall of card identities or spatial layout, the attribution is circular rather than diagnostic.
- [§4] §4 (evaluation results): The statement that the hardest configurations 'remain far from saturated by frontier MLLMs' and the Memory Gap analysis are presented without reported error bars, number of independent runs, or statistical tests on the residual-error breakdown. This makes it impossible to assess whether the reported dominance of forgetting is robust or sensitive to sampling variance.
minor comments (2)
- [Abstract] The abstract lists three controlled difficulty axes but does not enumerate the concrete parameter values used for grid size, visual pattern, and modality; these should be tabulated in §3 for reproducibility.
- [§3] Notation for the Memory Gap metric (e.g., how the 'gap' is computed from reconstruction-conditioned vs. unconditional policy error) should be formalized with an equation rather than described only in prose.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on RNG-Bench and the Memory Gap metric. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Memory Gap metric): The central claim that 'most residual errors stem from forgetting earlier observations rather than from suboptimal decision making' depends on the metric successfully disentangling the two factors. The manuscript states the benchmark is 'designed to isolate' reconstruction 'without conflating hidden-state reconstruction with other agent skills' but supplies no derivation, game-theoretic argument, or ablation showing that optimal action sequences exist independently of accurate reconstruction of hidden observations in Matching Pairs or 3D Maze. If every optimal action requires exact recall of card identities or spatial layout, the attribution is circular rather than diagnostic.
Authors: We agree that the current manuscript lacks an explicit derivation or ablation demonstrating independence between optimal action selection and full hidden-state reconstruction. The Memory Gap is computed by comparing model performance against an oracle policy that receives the complete observation history, but we did not include a formal argument showing that partial information suffices for some optimal moves in either game. We will add an appendix with (i) a game-theoretic definition of the metric, (ii) a proof sketch that optimal policies in Matching Pairs can exploit partial matches without full recall, and (iii) an ablation that perturbs reconstruction accuracy while holding action selection fixed. This addresses the circularity concern directly. revision: yes
-
Referee: [§4] §4 (evaluation results): The statement that the hardest configurations 'remain far from saturated by frontier MLLMs' and the Memory Gap analysis are presented without reported error bars, number of independent runs, or statistical tests on the residual-error breakdown. This makes it impossible to assess whether the reported dominance of forgetting is robust or sensitive to sampling variance.
Authors: The referee is correct that the submitted version omits error bars, run counts, and statistical tests for the Memory Gap breakdown. We will revise §4 to report results over five independent random seeds per configuration, include standard-error bars on all bar plots, and add paired t-tests (with p-values) comparing the forgetting versus decision-making error components to establish that the dominance of forgetting is statistically significant rather than an artifact of sampling variance. revision: yes
Circularity Check
No circularity: new benchmark and metric are independently defined
full rationale
The paper defines RNG-Bench, its two games, the duel protocol, and Memory Gap metric from first principles in the abstract and introduction to isolate reconstruction from other skills. No equations, fitted parameters, or self-citations are shown that reduce the Memory Gap attribution or performance claims to quantities defined by the authors' own prior work or by construction. The central empirical finding (most errors from forgetting) is presented as an outcome of running the benchmark on frontier models rather than a definitional tautology. The derivation chain is therefore self-contained against external model evaluations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The proposed games isolate reconstruction ability from other agent skills
invented entities (2)
-
RNG-Bench
no independent evidence
-
Memory Gap metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025
Pith/arXiv arXiv 2025
-
[2]
Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022
2022
-
[3]
L-eval: Instituting standardized evaluation for long context language models
Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14388–14411, 2024
2024
-
[4]
Longbench: A bilingual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...
2024
-
[5]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...
2025
-
[6]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016
Pith/arXiv arXiv 2016
-
[7]
Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026
ByteDance Seed. Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026. URLhttps://seed.bytedance.com/en/seed2
2026
-
[8]
clembench: Using game play to evaluate chat-optimized language models as conversational agents
Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219, 2023
2023
-
[9]
SpatialVLM: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[10]
ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 10 Beyond the Current Observ...
2024
-
[11]
Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025
arXiv 2025
-
[12]
Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024
arXiv 2024
-
[13]
Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, and Yuhang Zang. EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026
Pith/arXiv arXiv 2026
-
[14]
Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023
2023
-
[15]
MM-IFEngine: Towards multimodal instruction following
Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
2025
-
[16]
ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning
Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jiaqi Liang, et al. ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[17]
Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, et al. WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026
Pith/arXiv arXiv 2026
-
[18]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, HaodongDuan, MaosongCao, WenweiZhang, YiningLi, HangYan, YangGao, XinyueZhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. InternLM- XComposer2: Mastering free-form text-image composition and comprehension ...
Pith/arXiv arXiv 2024
-
[19]
VLMEvalKit: An open-source toolkit for evaluating large multi-modality models
HaodongDuan, XinyuFang, JunmingYang, XiangyuZhao, YuxuanQiao, MoLi, AmitAgarwal, ZheChen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi ...
2024
-
[20]
Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024
arXiv 2024
-
[21]
Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024
2024
-
[22]
Creation-mmbench: Assessing context-aware creative intelligence in mllms
Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, and Dahua Lin. Creation-mmbench: Assessing context-aware creative intelligence in mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 447–456, October 2025
2025
-
[23]
Gemini 3.1 Pro model card, February 2026
Google DeepMind. Gemini 3.1 Pro model card, February 2026. URLhttps://deepmind.google/ models/model-cards/gemini-3-1-pro/
2026
-
[24]
TextArena.arXiv preprint arXiv:2504.11442, 2025
Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. TextArena.arXiv preprint arXiv:2504.11442, 2025. 11 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
arXiv 2025
-
[25]
Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[26]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
Pith/arXiv arXiv 2024
-
[27]
lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Hao- jian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025
arXiv 2025
-
[28]
Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025
Pith/arXiv arXiv 2025
-
[29]
Smith, and Ranjay Krishna
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[30]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017
2017
-
[31]
Littman, and Anthony R
Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998
1998
-
[32]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022
2022
-
[33]
M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models
Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Yuxin Jiang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...
2024
-
[34]
Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, DanielleEpstein, IlliaPolosukhin, MatthewKelcey, JacobDevlin, KentonLee, KristinaN.Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transactions of...
2019
-
[35]
Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025
arXiv 2025
-
[36]
Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026
arXiv 2026
-
[37]
VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023
Pith/arXiv arXiv 2023
-
[38]
Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen. OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025
arXiv 2025
-
[39]
Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025
arXiv 2025
-
[40]
Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, and Aixin Sun. Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026. 12 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
arXiv 2026
-
[41]
AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023
Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023
arXiv 2023
-
[42]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023
2023
-
[43]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
2024
-
[44]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024
2024
-
[45]
Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning
Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[46]
STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence
Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, and Jiaqi Wang. STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[47]
MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024
Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024
2024
-
[48]
Visual-RFT: Visual reinforcement fine-tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2034–2044, 2025
2034
-
[49]
SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025
ZiyuLiu, YuhangZang, ShengyuanDing, YuhangCao, XiaoyiDong, HaodongDuan, DahuaLin, andJiaqi Wang. SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025
arXiv 2025
-
[50]
MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[51]
Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026
Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, and Yuhang Zang. Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026
Pith/arXiv arXiv 2026
-
[52]
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024
2024
-
[53]
Evaluating very long-term conversational memory of llm agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024
2024
-
[54]
Rossi, Seunghyun Yoon, and Hinrich Schütze
Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schütze. Nolima: Long-context evaluation beyond literal matching. InInternational Conference on Machine Learning, 2025
2025
-
[55]
Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025
ShenNie,FengqiZhu,ZebinYou,etal. Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025. 13 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
Pith/arXiv arXiv 2025
-
[56]
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021
Pith/arXiv arXiv 2021
-
[57]
Introducing gpt-5.4, March 2026
OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/
2026
-
[58]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[59]
Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026
Pith/arXiv arXiv 2026
-
[60]
Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024
Davide Paglieri, Bartlomiej Cupial, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker- Holder, and Tim Rocktaschel. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024
arXiv 2024
-
[61]
Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark
Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. InInternational Conference on Machine Learning, 2023
2023
-
[62]
Kilt: a benchmark for knowledge intensive language tasks
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Associati...
2021
-
[63]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/ blog?id=qwen3.5
2026
-
[64]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[65]
Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025
arXiv 2025
-
[66]
Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023
2023
-
[67]
Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026
Alina Shutova, Alexandra Olenina, Ivan Vinogradov, and Anton Sinitsin. Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026
Pith/arXiv arXiv 2026
-
[68]
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018
2018
-
[69]
Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025
arXiv 2025
-
[70]
Membench: Towards more comprehensive evaluation on the memory of llm-based agents
Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025. 14 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
2025
-
[71]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...
Pith/arXiv arXiv 2026
-
[72]
Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021
2021
-
[73]
Ada-leval: Evaluating long-context llms with length-adaptable benchmarks
Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, and Kai Chen. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3712–3724, 2024
2024
-
[74]
Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models
Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. InProceedings of the 2025 Conference of the Nations of the Americas 15 Beyond the Current Observation: Evaluating Multimodal Large La...
2025
-
[75]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022
2022
-
[76]
VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025
2025
-
[77]
SIM-CoT: Supervised implicit chain-of-thought
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. SIM-CoT: Supervised implicit chain-of-thought. InInternational Conference on Learning Representations, 2026
2026
-
[78]
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024
Pith/arXiv arXiv 2024
-
[79]
LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024
2024
-
[80]
Mitchell, and Yuanzhi Li
Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. SmartPlay: A benchmark for LLMs as intelligent agents. InInternational Conference on Learning Representations, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.