Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Dahua Lin; Haodong Duan; Jiaqi Wang; Shengyuan Ding; Xilin Wei; Xinyu Fang; Yuhang Zang

arxiv: 2606.19338 · v1 · pith:VBDEKFGInew · submitted 2026-06-17 · 💻 cs.CV

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Shengyuan Ding , Xilin Wei , Xinyu Fang , Haodong Duan , Dahua Lin , Jiaqi Wang , Yuhang Zang This is my paper

Pith reviewed 2026-06-26 21:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords RNG-Benchmultimodal large language modelsnon-Markov gamesmemory reconstructionMemory Gap metricfine-tuning transfer3D Mazeagent evaluation

0 comments

The pith

Multimodal models fail mostly by forgetting past observations in non-Markov games rather than making poor decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RNG-Bench to test how well multimodal large language models reconstruct and act on observations no longer visible during ongoing interactions. It uses two games, Matching Pairs and 3D Maze, under controlled axes of difficulty to isolate hidden-state reconstruction from other skills. A Memory Gap metric and duel protocol separate forgetting from suboptimal choices, revealing that most errors trace to lost earlier inputs. The hardest settings demand 128K-token contexts and hundreds of images, where frontier models still fall short. Fine-tuning on optimal demonstrations raises scores on this benchmark and transfers to others without harming general capabilities.

Core claim

RNG-Bench isolates a base model's ability to reconstruct past observations and act on them during multi-step interaction through Matching Pairs and 3D Maze games evaluated under three controlled difficulty axes, with a head-to-head duel protocol and Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode and remain far from saturated by frontier MLLMs, while Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstratio

What carries the argument

RNG-Bench benchmark suite with its Memory Gap metric that disentangles forgetting from poor action selection, evaluated via head-to-head duel protocol across two games.

If this is right

Hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode.
Most residual errors stem from forgetting earlier observations rather than suboptimal decision making.
Frontier MLLMs remain far from saturated on RNG-Bench.
Fine-tuning on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench.
Fine-tuning transfers to existing benchmarks without degrading general multimodal capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that retain visual history across hundreds of steps could close the observed performance gap more directly than further context scaling.
The duel protocol and Memory Gap separation could be adapted to isolate memory effects in other agent benchmarks that currently mix multiple skills.
Transfer after fine-tuning suggests the benchmark measures a generalizable reconstructive skill rather than game-specific memorization.

Load-bearing premise

The two games and Memory Gap metric successfully isolate reconstruction of past observations from other agent skills.

What would settle it

A model achieving near-zero Memory Gap scores on the hardest configurations while still producing high error rates traceable to decision errors in separate non-memory tasks would indicate the metric does not isolate forgetting as claimed.

read the original abstract

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RNG-Bench is a reasonable new testbed for memory in closed-loop MLLM agents, but the Memory Gap metric's claim to isolate forgetting from decision errors rests on an assumption that may not hold in the chosen games.

read the letter

The paper's core offering is RNG-Bench, built around Matching Pairs and 3D Maze, with a duel protocol and a Memory Gap metric meant to attribute errors to forgetting rather than bad policy. They also run a fine-tuning experiment on Qwen3.5-9B that shows gains on the new suite plus transfer to other benchmarks.

The setup addresses a real gap: most existing tests either give the full state or evaluate recall only at the end. Adding controlled axes for grid size, pattern, and modality, plus the long-context regimes (128k tokens, 350 images), gives a practical way to stress memory during interaction. The transfer result without general capability loss is the kind of concrete check that matters.

The soft spot is the separation the metric relies on. The stress-test note is on point here. If every optimal action in these games depends on correctly reconstructing the hidden card or layout, then reconstruction failure and suboptimal action are the same thing; the residual attribution to "forgetting rather than decision making" stops being diagnostic. The abstract asserts the games were chosen to avoid conflating the two, but without seeing the explicit validation or counter-example sequences in the methods, it is hard to know whether that separation was actually demonstrated or just assumed. That is the load-bearing claim, so it needs to be shown, not stated.

The work is aimed at people building or evaluating multimodal agents that must act over hidden history. It is worth sending to peer review because the benchmark direction is useful and the questions are well-posed, even if the metric will need tighter justification.

Referee Report

2 major / 2 minor

Summary. The paper introduces RNG-Bench, a benchmark suite with Matching Pairs and 3D Maze games, to evaluate MLLMs on reconstructing and acting upon past observations in controllable non-Markov environments. It defines three difficulty axes (grid size, visual pattern, observation modality), a head-to-head duel protocol, and a Memory Gap metric intended to separate forgetting from suboptimal action selection. Experiments show frontier models remain far from saturation at scales of ~128K tokens and 350 images per episode, attribute most residual errors to forgetting, and report that fine-tuning Qwen3.5-9B on optimal-policy rollouts improves RNG-Bench performance while transferring to other benchmarks without degrading general multimodal capability.

Significance. If the Memory Gap metric validly isolates reconstruction failure, the benchmark would provide a useful diagnostic tool for memory limitations in closed-loop multimodal agents and a concrete path for improvement via targeted fine-tuning. The controlled axes and duel protocol are methodologically positive for reducing instance variance. The transfer results, if robust, would strengthen the practical value of the work.

major comments (2)

[Abstract and §3] Abstract and §3 (Memory Gap metric): The central claim that 'most residual errors stem from forgetting earlier observations rather than from suboptimal decision making' depends on the metric successfully disentangling the two factors. The manuscript states the benchmark is 'designed to isolate' reconstruction 'without conflating hidden-state reconstruction with other agent skills' but supplies no derivation, game-theoretic argument, or ablation showing that optimal action sequences exist independently of accurate reconstruction of hidden observations in Matching Pairs or 3D Maze. If every optimal action requires exact recall of card identities or spatial layout, the attribution is circular rather than diagnostic.
[§4] §4 (evaluation results): The statement that the hardest configurations 'remain far from saturated by frontier MLLMs' and the Memory Gap analysis are presented without reported error bars, number of independent runs, or statistical tests on the residual-error breakdown. This makes it impossible to assess whether the reported dominance of forgetting is robust or sensitive to sampling variance.

minor comments (2)

[Abstract] The abstract lists three controlled difficulty axes but does not enumerate the concrete parameter values used for grid size, visual pattern, and modality; these should be tabulated in §3 for reproducibility.
[§3] Notation for the Memory Gap metric (e.g., how the 'gap' is computed from reconstruction-conditioned vs. unconditional policy error) should be formalized with an equation rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on RNG-Bench and the Memory Gap metric. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Memory Gap metric): The central claim that 'most residual errors stem from forgetting earlier observations rather than from suboptimal decision making' depends on the metric successfully disentangling the two factors. The manuscript states the benchmark is 'designed to isolate' reconstruction 'without conflating hidden-state reconstruction with other agent skills' but supplies no derivation, game-theoretic argument, or ablation showing that optimal action sequences exist independently of accurate reconstruction of hidden observations in Matching Pairs or 3D Maze. If every optimal action requires exact recall of card identities or spatial layout, the attribution is circular rather than diagnostic.

Authors: We agree that the current manuscript lacks an explicit derivation or ablation demonstrating independence between optimal action selection and full hidden-state reconstruction. The Memory Gap is computed by comparing model performance against an oracle policy that receives the complete observation history, but we did not include a formal argument showing that partial information suffices for some optimal moves in either game. We will add an appendix with (i) a game-theoretic definition of the metric, (ii) a proof sketch that optimal policies in Matching Pairs can exploit partial matches without full recall, and (iii) an ablation that perturbs reconstruction accuracy while holding action selection fixed. This addresses the circularity concern directly. revision: yes
Referee: [§4] §4 (evaluation results): The statement that the hardest configurations 'remain far from saturated by frontier MLLMs' and the Memory Gap analysis are presented without reported error bars, number of independent runs, or statistical tests on the residual-error breakdown. This makes it impossible to assess whether the reported dominance of forgetting is robust or sensitive to sampling variance.

Authors: The referee is correct that the submitted version omits error bars, run counts, and statistical tests for the Memory Gap breakdown. We will revise §4 to report results over five independent random seeds per configuration, include standard-error bars on all bar plots, and add paired t-tests (with p-values) comparing the forgetting versus decision-making error components to establish that the dominance of forgetting is statistically significant rather than an artifact of sampling variance. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and metric are independently defined

full rationale

The paper defines RNG-Bench, its two games, the duel protocol, and Memory Gap metric from first principles in the abstract and introduction to isolate reconstruction from other skills. No equations, fitted parameters, or self-citations are shown that reduce the Memory Gap attribution or performance claims to quantities defined by the authors' own prior work or by construction. The central empirical finding (most errors from forgetting) is presented as an outcome of running the benchmark on frontier models rather than a definitional tautology. The derivation chain is therefore self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces a benchmark and metric without additional fitted parameters or unstated mathematical axioms beyond the design goal of isolating reconstruction.

axioms (1)

domain assumption The proposed games isolate reconstruction ability from other agent skills
Stated as the design intent in the abstract.

invented entities (2)

RNG-Bench no independent evidence
purpose: Benchmark suite for non-Markov reconstruction in MLLMs
Newly introduced in the paper.
Memory Gap metric no independent evidence
purpose: Disentangle forgetting from suboptimal action selection
Newly introduced in the paper.

pith-pipeline@v0.9.1-grok · 5799 in / 1357 out tokens · 28838 ms · 2026-06-26T21:15:03.810123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 20 linked inside Pith

[1]

Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

Pith/arXiv arXiv 2025
[2]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

2022
[3]

L-eval: Instituting standardized evaluation for long context language models

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14388–14411, 2024

2024
[4]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024
[5]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

2025
[6]

Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

Pith/arXiv arXiv 2016
[7]

Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026

ByteDance Seed. Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026. URLhttps://seed.bytedance.com/en/seed2

2026
[8]

clembench: Using game play to evaluate chat-optimized language models as conversational agents

Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219, 2023

2023
[9]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[10]

ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 10 Beyond the Current Observ...

2024
[11]

Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

arXiv 2025
[12]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

arXiv 2024
[13]

EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, and Yuhang Zang. EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026

Pith/arXiv arXiv 2026
[14]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023

2023
[15]

MM-IFEngine: Towards multimodal instruction following

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[16]

ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning

Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jiaqi Liang, et al. ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[17]

WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, et al. WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026

Pith/arXiv arXiv 2026
[18]

InternLM- XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, HaodongDuan, MaosongCao, WenweiZhang, YiningLi, HangYan, YangGao, XinyueZhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. InternLM- XComposer2: Mastering free-form text-image composition and comprehension ...

Pith/arXiv arXiv 2024
[19]

VLMEvalKit: An open-source toolkit for evaluating large multi-modality models

HaodongDuan, XinyuFang, JunmingYang, XiangyuZhao, YuxuanQiao, MoLi, AmitAgarwal, ZheChen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi ...

2024
[20]

GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024

arXiv 2024
[21]

Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024

2024
[22]

Creation-mmbench: Assessing context-aware creative intelligence in mllms

Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, and Dahua Lin. Creation-mmbench: Assessing context-aware creative intelligence in mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 447–456, October 2025

2025
[23]

Gemini 3.1 Pro model card, February 2026

Google DeepMind. Gemini 3.1 Pro model card, February 2026. URLhttps://deepmind.google/ models/model-cards/gemini-3-1-pro/

2026
[24]

TextArena.arXiv preprint arXiv:2504.11442, 2025

Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. TextArena.arXiv preprint arXiv:2504.11442, 2025. 11 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

arXiv 2025
[25]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[26]

Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

Pith/arXiv arXiv 2024
[27]

lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Hao- jian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

arXiv 2025
[28]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Pith/arXiv arXiv 2025
[29]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[30]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

2017
[31]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

1998
[32]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022

2022
[33]

M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models

Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Yuxin Jiang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

2024
[34]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, DanielleEpstein, IlliaPolosukhin, MatthewKelcey, JacobDevlin, KentonLee, KristinaN.Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transactions of...

2019
[35]

Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025

arXiv 2025
[36]

Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026

arXiv 2026
[37]

VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

Pith/arXiv arXiv 2023
[38]

OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen. OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

arXiv 2025
[39]

NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025

arXiv 2025
[40]

Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026

Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, and Aixin Sun. Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026. 12 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

arXiv 2026
[41]

AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023

Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023

arXiv 2023
[42]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[43]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[44]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024

2024
[45]

Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[46]

STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence

Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, and Jiaqi Wang. STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence. InInternational Conference on Learning Representations (ICLR), 2026

2026
[47]

MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024

2024
[48]

Visual-RFT: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2034–2044, 2025

2034
[49]

SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025

ZiyuLiu, YuhangZang, ShengyuanDing, YuhangCao, XiaoyiDong, HaodongDuan, DahuaLin, andJiaqi Wang. SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025

arXiv 2025
[50]

MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models

Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models. InInternational Conference on Learning Representations (ICLR), 2025

2025
[51]

Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, and Yuhang Zang. Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026

Pith/arXiv arXiv 2026
[52]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

2024
[53]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[54]

Rossi, Seunghyun Yoon, and Hinrich Schütze

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schütze. Nolima: Long-context evaluation beyond literal matching. InInternational Conference on Machine Learning, 2025

2025
[55]

Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025

ShenNie,FengqiZhu,ZebinYou,etal. Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025. 13 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Pith/arXiv arXiv 2025
[56]

Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Pith/arXiv arXiv 2021
[57]

Introducing gpt-5.4, March 2026

OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026
[58]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[59]

Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026

Pith/arXiv arXiv 2026
[60]

Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

Davide Paglieri, Bartlomiej Cupial, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker- Holder, and Tim Rocktaschel. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

arXiv 2024
[61]

Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. InInternational Conference on Machine Learning, 2023

2023
[62]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Associati...

2021
[63]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/ blog?id=qwen3.5

2026
[64]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[65]

Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

arXiv 2025
[66]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

2023
[67]

Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, and Anton Sinitsin. Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026

Pith/arXiv arXiv 2026
[68]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

2018
[69]

SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

arXiv 2025
[70]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025. 14 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

2025
[71]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

Pith/arXiv arXiv 2026
[72]

Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

2021
[73]

Ada-leval: Evaluating long-context llms with length-adaptable benchmarks

Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, and Kai Chen. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3712–3724, 2024

2024
[74]

Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. InProceedings of the 2025 Conference of the Nations of the Americas 15 Beyond the Current Observation: Evaluating Multimodal Large La...

2025
[75]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

2022
[76]

VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025

2025
[77]

SIM-CoT: Supervised implicit chain-of-thought

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. SIM-CoT: Supervised implicit chain-of-thought. InInternational Conference on Learning Representations, 2026

2026
[78]

Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Pith/arXiv arXiv 2024
[79]

LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

2024
[80]

Mitchell, and Yuanzhi Li

Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. SmartPlay: A benchmark for LLMs as intelligent agents. InInternational Conference on Learning Representations, 2024

2024

Showing first 80 references.

[1] [1]

Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

Pith/arXiv arXiv 2025

[2] [2]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

2022

[3] [3]

L-eval: Instituting standardized evaluation for long context language models

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14388–14411, 2024

2024

[4] [4]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024

[5] [5]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

2025

[6] [6]

Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

Pith/arXiv arXiv 2016

[7] [7]

Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026

ByteDance Seed. Seed-2.0-Lite-260428: Omni-modal understanding across video, image, audio, and text, April 2026. URLhttps://seed.bytedance.com/en/seed2

2026

[8] [8]

clembench: Using game play to evaluate chat-optimized language models as conversational agents

Kranti Chalamalasetti, Jana Götze, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, and David Schlangen. clembench: Using game play to evaluate chat-optimized language models as conversational agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11174–11219, 2023

2023

[9] [9]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[10] [10]

ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. ShareGPT4Video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 10 Beyond the Current Observ...

2024

[11] [11]

Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

arXiv 2025

[12] [12]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

arXiv 2024

[13] [13]

EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, and Yuhang Zang. EndoCoT: Scaling endogenous chain-of-thought reasoning in diffusion models.arXiv preprint arXiv:2603.12252, 2026

Pith/arXiv arXiv 2026

[14] [14]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2023

2023

[15] [15]

MM-IFEngine: Towards multimodal instruction following

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. MM-IFEngine: Towards multimodal instruction following. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[16] [16]

ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning

Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jiaqi Liang, et al. ARM-Thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[17] [17]

WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, et al. WildClawBench: A benchmark for real-world, long-horizon agent evaluation.arXiv preprint arXiv:2605.10912, 2026

Pith/arXiv arXiv 2026

[18] [18]

InternLM- XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, HaodongDuan, MaosongCao, WenweiZhang, YiningLi, HangYan, YangGao, XinyueZhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. InternLM- XComposer2: Mastering free-form text-image composition and comprehension ...

Pith/arXiv arXiv 2024

[19] [19]

VLMEvalKit: An open-source toolkit for evaluating large multi-modality models

HaodongDuan, XinyuFang, JunmingYang, XiangyuZhao, YuxuanQiao, MoLi, AmitAgarwal, ZheChen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi ...

2024

[20] [20]

GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations.arXiv preprint arXiv:2402.12348, 2024

arXiv 2024

[21] [21]

Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: Along-formmulti-shotbenchmarkforholisticvideounderstanding.AdvancesinNeuralInformation Processing Systems, 37:89098–89124, 2024

2024

[22] [22]

Creation-mmbench: Assessing context-aware creative intelligence in mllms

Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, and Dahua Lin. Creation-mmbench: Assessing context-aware creative intelligence in mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 447–456, October 2025

2025

[23] [23]

Gemini 3.1 Pro model card, February 2026

Google DeepMind. Gemini 3.1 Pro model card, February 2026. URLhttps://deepmind.google/ models/model-cards/gemini-3-1-pro/

2026

[24] [24]

TextArena.arXiv preprint arXiv:2504.11442, 2025

Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. TextArena.arXiv preprint arXiv:2504.11442, 2025. 11 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

arXiv 2025

[25] [25]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[26] [26]

Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

Pith/arXiv arXiv 2024

[27] [27]

lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Hao- jian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146, 2025

arXiv 2025

[28] [28]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Pith/arXiv arXiv 2025

[29] [29]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[30] [30]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

2017

[31] [31]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

1998

[32] [32]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022

2022

[33] [33]

M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models

Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Yuxin Jiang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

2024

[34] [34]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, DanielleEpstein, IlliaPolosukhin, MatthewKelcey, JacobDevlin, KentonLee, KristinaN.Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transactions of...

2019

[35] [35]

Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Training- free variable-length denoising for diffusion large language models.arXiv preprint arXiv:2508.00819, 2025

arXiv 2025

[36] [36]

Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Visual self-refine: A pixel-guided paradigm for accurate chart parsing.arXiv preprint arXiv:2602.16455, 2026

arXiv 2026

[37] [37]

VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. VideoChat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

Pith/arXiv arXiv 2023

[38] [38]

OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen. OPT-Bench: Evaluating LLM agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

arXiv 2025

[39] [39]

NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. NP-Engine: Empowering optimization reasoning in large language models with verifiable synthetic NP problems.arXiv preprint arXiv:2510.16476, 2025

arXiv 2025

[40] [40]

Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026

Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, and Aixin Sun. Emembench: Interactive benchmarking of episodic memory for vlm agents.arXiv preprint arXiv:2601.16690, 2026. 12 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

arXiv 2026

[41] [41]

AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023

Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs playing the game of Avalon.arXiv preprint arXiv:2310.05036, 2023

arXiv 2023

[42] [42]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[43] [43]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024

[44] [44]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024

2024

[45] [45]

Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-SSRL: Enhancing spatial understanding via self-supervised reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[46] [46]

STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence

Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, and Jiaqi Wang. STAR-Bench: Probing deep spatio- temporal reasoning as audio 4d intelligence. InInternational Conference on Learning Representations (ICLR), 2026

2026

[47] [47]

MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for LVLMs.Advances in Neural Information Processing Systems, 37: 8698–8733, 2024

2024

[48] [48]

Visual-RFT: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2034–2044, 2025

2034

[49] [49]

SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025

ZiyuLiu, YuhangZang, ShengyuanDing, YuhangCao, XiaoyiDong, HaodongDuan, DahuaLin, andJiaqi Wang. SPARK: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624, 2025

arXiv 2025

[50] [50]

MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models

Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. MIA-DPO: Multi-image augmented direct preference optimization for large vision-language models. InInternational Conference on Learning Representations (ICLR), 2025

2025

[51] [51]

Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, and Yuhang Zang. Visual-ERM: Reward modeling for visual equivalence.arXiv preprint arXiv:2603.13224, 2026

Pith/arXiv arXiv 2026

[52] [52]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

2024

[53] [53]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024

[54] [54]

Rossi, Seunghyun Yoon, and Hinrich Schütze

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schütze. Nolima: Long-context evaluation beyond literal matching. InInternational Conference on Machine Learning, 2025

2025

[55] [55]

Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025

ShenNie,FengqiZhu,ZebinYou,etal. Largelanguagediffusionmodels.arXivpreprintarXiv:2502.09992, 2025. 13 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Pith/arXiv arXiv 2025

[56] [56]

Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Pith/arXiv arXiv 2021

[57] [57]

Introducing gpt-5.4, March 2026

OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026

[58] [58]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[59] [59]

Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. Game- world: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026

Pith/arXiv arXiv 2026

[60] [60]

Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

Davide Paglieri, Bartlomiej Cupial, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker- Holder, and Tim Rocktaschel. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

arXiv 2024

[61] [61]

Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. InInternational Conference on Machine Learning, 2023

2023

[62] [62]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Associati...

2021

[63] [63]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/ blog?id=qwen3.5

2026

[64] [64]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[65] [65]

Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

arXiv 2025

[66] [66]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

2023

[67] [67]

Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, and Anton Sinitsin. Evaluating memory structure in llm agents.arXiv preprint arXiv:2602.11243, 2026

Pith/arXiv arXiv 2026

[68] [68]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

2018

[69] [69]

SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. SEAgent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025

arXiv 2025

[70] [70]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025. 14 Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

2025

[71] [71]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

Pith/arXiv arXiv 2026

[72] [72]

Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

2021

[73] [73]

Ada-leval: Evaluating long-context llms with length-adaptable benchmarks

Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, and Kai Chen. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3712–3724, 2024

2024

[74] [74]

Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. InProceedings of the 2025 Conference of the Nations of the Americas 15 Beyond the Current Observation: Evaluating Multimodal Large La...

2025

[75] [75]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

2022

[76] [76]

VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. VideoRoPE: What makes for good video rotary position embedding? InProceedings of the 42nd International Conference on Machine Learning, pages 66118–66136, 2025

2025

[77] [77]

SIM-CoT: Supervised implicit chain-of-thought

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. SIM-CoT: Supervised implicit chain-of-thought. InInternational Conference on Learning Representations, 2026

2026

[78] [78]

Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Pith/arXiv arXiv 2024

[79] [79]

LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

2024

[80] [80]

Mitchell, and Yuanzhi Li

Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. SmartPlay: A benchmark for LLMs as intelligent agents. InInternational Conference on Learning Representations, 2024

2024