M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Fangyuan Zhang; Junle Chen; Qintian Guo; Wei Chen; Wenxuan Liu; Xiaofang Zhou; Yuqian Wu; Zhengjun Huang; Zhoujin Tian

arxiv: 2606.07402 · v1 · pith:EKY4TXLAnew · submitted 2026-06-05 · 💻 cs.CL

M³Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Zhengjun Huang , Wenxuan Liu , Zhoujin Tian , Wei Chen , Junle Chen , Yuqian Wu , Fangyuan Zhang , Qintian Guo

show 1 more author

Xiaofang Zhou

This is my paper

Pith reviewed 2026-06-27 22:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal memorybenchmarkMLLMsuser-agent interactionscross-modal groundingimplicit information inferencememory efficiencyquery modality bias

0 comments

The pith

M³Exam benchmark reveals gaps in cross-modal grounding and cross-session reasoning for multimodal language models in realistic user-agent settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M³Exam, a query-centric benchmark for multimodal conversational memory built on realistic user-agent interactions. It evaluates models on cross-modal grounding and implicit information inference, showing that current MLLMs and memory systems have persistent gaps in these areas and in handling accumulating context efficiently. The authors also propose M³Proctor, a method that detects query modality bias and accesses raw visuals only when needed, which boosts accuracy while reducing computational costs substantially.

Core claim

M³Exam is a benchmark for multimodal memory in user-agent interactions that includes concealed user information and authentic multimodal files. Benchmarking shows gaps in cross-modal grounding, cross session reasoning, and efficiency. M³Proctor detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

What carries the argument

M³Exam, a query-centric multimodal conversational memory benchmark, and M³Proctor, a multimodal memory method that detects query modality bias.

If this is right

Agents will require improved mechanisms for cross-modal grounding to handle realistic interactions.
Memory systems must address cross-session reasoning over accumulating multimodal context.
The efficiency cost of accumulating multimodal context needs to be mitigated for practical deployment.
M³Proctor-style on-demand visual consumption can reduce index-construction time and retrieved tokens significantly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modality bias detection in M³Proctor might generalize to improve efficiency in non-memory multimodal tasks such as visual question answering.
Future work could test whether the gaps persist when user information is not concealed but still multimodal.
Lower retrieved token counts could allow agents to maintain longer interaction histories without proportional compute increases.

Load-bearing premise

The M³Exam benchmark construction accurately captures realistic user-agent interactions including concealed user information and authentic multimodal file use.

What would settle it

A direct comparison where M³Proctor fails to deliver the reported accuracy gain or token reduction when applied to a different set of user-agent interaction logs with real multimodal files.

Figures

Figures reproduced from arXiv: 2606.07402 by Fangyuan Zhang, Junle Chen, Qintian Guo, Wei Chen, Wenxuan Liu, Xiaofang Zhou, Yuqian Wu, Zhengjun Huang, Zhoujin Tian.

**Figure 1.** Figure 1: Overview of M 3Exam. lized, they serve as the ideal testbed for evaluating an agent’s long-term multimodal capabilities. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overall pipeline of M 3Exam, designed to evaluate multimodal memory ability in realistic scenarios. 2 M 3Exam: Agent Benchmark This section describes M 3Exam in detail ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of M 3Proctor, designed to detect modality bias within queries and retrieve only on demand. carry memory across long horizon, each exchange is condensed into a key summary 𝜎𝑘 appended to a running memory 𝑆𝑘 = 𝑆𝑘−1 ∪ {𝜎𝑘 }, on which the user simulator 𝜋user conditions the next round: 𝑟𝑘+1 ∼ 𝜋user · | W𝑘+1, 𝑆𝑘, P . (1) The round stream is then partitioned into sessions D = (𝐷1, . . . , 𝐷𝐿). A two-… view at source ↗

**Figure 4.** Figure 4: Multimodal influence and capability profile on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: In-depth analysis of cascade ablation and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Distributional properties of M 3Exam. (1) Memory look-back distance. For each cross-session question, the session-index gap between the earliest and latest round it cites in supporting_facts; the curve plots the fraction of questions whose evidence spans at least 𝑘 sessions. (2) Turn-length density. Gaussian-kernel density of word counts for user versus assistant turns. (3) QA-type coverage across personas… view at source ↗

**Figure 8.** Figure 8: QA-type composition and difficulty of M 3Exam. (A) Distribution. Number and share of the 5,150 QA items across the eight question types. (B) Difficulty map. Each type is placed by its average evidence load (𝑥: number of supporting facts) and memory span (𝑦: number of distinct sessions the supporting facts span), with bubble radius growing with the fraction of items requiring visual or PDF grounding. Colors… view at source ↗

**Figure 9.** Figure 9: Frontier-MLLM arena on M 3Exam. We evaluate eight frontier closed-source multimodal LLMs—GLM5.1, Qwen3.6-Plus, Claude-Opus-4.6, Doubao-Seed-2.0, GPT-5.4, Gemini-3.1-Pro, Kimi-k2.5 and Grok-4—as the answering model on M 3Exam, scoring every model under the four metrics (EM, F1, BLEU-1 and LLM-J). Left: overall ranking, where each model is scored under a unified standard combining the four metrics. Right: p… view at source ↗

**Figure 10.** Figure 10: Case study with Implicit Inference Example [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Case study with Implicit Inference Example [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: The full history adds context and retrieval pressure. For each backbone and metric in the multimodal regime, we compare feeding only the gold supporting session(s) (Session-only, retrieval relieved) against feeding the entire conversation (Full context). Bars are overlapped on a shared baseline; the green tail and its label give the Session-only−Full context gap, i.e. the accuracy lost to processing the … view at source ↗

**Figure 13.** Figure 13: Prompt template for timeline generation. The event count [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt template for distractor (side-event) generation. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt template for the timeline self-check audit. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt template for the thematic user simulator (follow-up turn). [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt template for the thematic assistant. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt template for the per-chunk dialogue self-check. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt template for the cross-chunk dialogue self-check. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt template for per-round summarisation (cross-round memory). [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt template for the LLM-as-a-Judge metric (five-level rubric). [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Answering prompt used by M 3Proctor. The base template (top) is shared by all question types; the part below the rule is appended only for find-matching-image (fm) questions. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

read the original abstract

Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M³Exam targets a real gap in multimodal agent benchmarks with a query-centric setup and on-demand visuals, and the reported efficiency gains look plausible on the surface.

read the letter

The paper's core move is introducing M³Exam as a benchmark built around realistic user-agent interactions rather than sparse human-human exchanges, plus M³Proctor as a method that detects modality bias and pulls raw visuals only when needed. That distinction matters for agents that accumulate mixed data over sessions.

What the work does well is name concrete failure modes—cross-modal grounding, cross-session reasoning, and the token cost of full context accumulation—and then show a selective approach that claims a 13% accuracy lift while cutting index time and retrieved tokens by over 70%. The on-demand design is a practical response to the efficiency problem.

The soft spots are mostly around verification. The abstract gives the performance deltas but no visible details on how the dataset was constructed, how concealed information was validated, or what controls were used against post-hoc selection. Without those, the gains are hard to weigh. The stress-test found no internal contradiction in the claim structure itself, so nothing load-bearing appears broken on the given evidence.

This is for researchers building or evaluating memory systems for multimodal agents who need benchmarks that reflect accumulating files and implicit user context. A reader already working on agent memory or MLLM evaluation would find the gap analysis and the selective method useful to consider.

It deserves peer review. The idea addresses a documented limitation in prior benchmarks, the method is straightforward enough to test, and the efficiency angle is worth checking with the full experimental details.

Referee Report

2 major / 2 minor

Summary. The paper introduces M³Exam, a query-centric multimodal conversational memory benchmark built from realistic user-agent interactions that include multimodal files and concealed user information. It evaluates MLLMs and memory systems on cross-modal grounding, cross-session reasoning, and efficiency, identifies persistent gaps, and proposes M³Proctor, which detects query modality bias and consumes raw visuals on demand, reporting a 13% accuracy improvement and over 70% reduction in index-construction time and retrieved tokens.

Significance. If the benchmark construction and evaluation hold, the work provides a more realistic testbed for multimodal agent memory than existing human-human sparse-visual benchmarks, and the on-demand processing approach in M³Proctor offers a concrete efficiency mechanism that could influence practical memory system design.

major comments (2)

[Benchmark construction section] Benchmark construction section: the central claim that M³Exam captures 'realistic user-agent interaction' with concealed information and authentic multimodal file use (as opposed to sparse human-human forms) is load-bearing for all downstream results, yet the manuscript provides no explicit validation against real interaction logs, inter-annotator agreement on concealment, or quantitative comparison showing reduced sparsity relative to prior benchmarks.
[Experimental results] Experimental results (accuracy and efficiency claims): the reported 13% accuracy gain and >70% reductions in index-construction time/retrieved tokens are presented without error bars, number of runs, statistical tests, or ablation on the modality-bias detection component, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.

minor comments (2)

[Abstract] Abstract and introduction: the acronym M³Proctor is used before its expansion or high-level description is given.
[Evaluation metrics] Notation: 'cross-modal grounding' and 'implicit information inference' are used as evaluation dimensions without a precise definition or example query in the main text.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [Benchmark construction section] Benchmark construction section: the central claim that M³Exam captures 'realistic user-agent interaction' with concealed information and authentic multimodal file use (as opposed to sparse human-human forms) is load-bearing for all downstream results, yet the manuscript provides no explicit validation against real interaction logs, inter-annotator agreement on concealment, or quantitative comparison showing reduced sparsity relative to prior benchmarks.

Authors: The referee correctly identifies that the manuscript lacks explicit validation against real interaction logs and does not report inter-annotator agreement. Section 3 describes a query-centric construction process that simulates user-agent interactions with multimodal files and concealed information drawn from typical real-world patterns. In revision we will add a table with quantitative sparsity metrics (visual elements per turn and concealment rate) comparing M³Exam to prior human-human benchmarks. We will also clarify that concealment follows deterministic rules rather than subjective judgments, rendering inter-annotator agreement inapplicable. Direct validation against real logs is not feasible because the benchmark uses simulated interactions and we do not have access to proprietary real-world logs. revision: partial
Referee: [Experimental results] Experimental results (accuracy and efficiency claims): the reported 13% accuracy gain and >70% reductions in index-construction time/retrieved tokens are presented without error bars, number of runs, statistical tests, or ablation on the modality-bias detection component, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.

Authors: We agree that the experimental presentation would be strengthened by additional statistical detail. The revised manuscript will report all accuracy and efficiency numbers as means over five independent runs with standard error bars, include paired statistical significance tests for the 13% accuracy improvement, and add an ablation isolating the modality-bias detection component of M³Proctor. These additions will demonstrate that the reported gains are robust rather than sensitive to post-hoc choices. revision: yes

standing simulated objections not resolved

Direct validation against real interaction logs

Circularity Check

0 steps flagged

No significant circularity; benchmark and method are empirically grounded without derivations or self-referential reductions.

full rationale

The paper introduces M³Exam as a new benchmark for multimodal memory in user-agent interactions and proposes M³Proctor as an on-demand retrieval method. No equations, derivations, or parameter-fitting steps are present in the abstract or described methodology. Claims rest on empirical accuracy/efficiency gains (13% accuracy, 70% token reduction) measured against the introduced benchmark, without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The construction of the benchmark is presented as an independent contribution rather than derived from prior fitted results. This is a standard empirical benchmark paper with no internal reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; ledger is empty.

pith-pipeline@v0.9.1-grok · 5685 in / 1086 out tokens · 25572 ms · 2026-06-27T22:09:28.545214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 1 canonical work pages

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

2007
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

2005
[8]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

1965
[9]

Making the

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , booktitle =. Making the
[10]

Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle =
[11]

Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul , booktitle =
[12]

and Kumar, Pratyush , booktitle =

Methani, Nitesh and Ganguly, Pritha and Khapra, Mitesh M. and Kumar, Pratyush , booktitle =
[13]

arXiv preprint arXiv:1710.07300 , year =

Kahou, Samira Ebrahimi and Michalski, Vincent and Atkinson, Adam and K. arXiv preprint arXiv:1710.07300 , year =

Pith/arXiv arXiv
[14]

Kafle, Kushal and Price, Brian and Cohen, Scott and Kanan, Christopher , booktitle =
[15]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

A Diagram Is Worth a Dozen Images , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =
[16]

Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =
[17]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

Mathew, Minesh and Bagal, Viraj and Tito, Rub. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =
[18]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Image-Chat: Engaging Grounded Conversations , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[19]

Zang, Xiaoxue and Liu, Lijuan and Wang, Maria and Song, Yang and Zhang, Hao and Chen, Jindong , booktitle =
[20]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[21]

Feng, Jiazhan and Sun, Qingfeng and Xu, Can and Zhao, Pu and Yang, Yaming and Tao, Chongyang and Zhao, Dongyan and Lin, Qingwei , booktitle =
[22]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Beyond Goldfish Memory: Long-Term Open-Domain Conversation , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[23]

Evaluating Very Long-Term Conversational Memory of

Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , booktitle =. Evaluating Very Long-Term Conversational Memory of
[24]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[25]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford
[26]

Findings of the Association for Computational Linguistics: NAACL 2024 , year =

Faithful Persona-based Conversational Dataset Generation with Large Language Models , author =. Findings of the Association for Computational Linguistics: NAACL 2024 , year =

2024
[27]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =
[28]

Han, Yucheng and Zhang, Chi and Chen, Xin and Yang, Xu and Wang, Zhibin and Yu, Gang and Fu, Bin and Zhang, Hanwang , booktitle =
[29]

2023 , howpublished =

2023
[30]

arXiv preprint arXiv:2502.13923 , year=

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[31]

2024 , url=

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=. 2024 , url=

2024
[32]

arXiv preprint arXiv:2403.05530 , year=

Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context , author=. arXiv preprint arXiv:2403.05530 , year=

Pith/arXiv arXiv
[33]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[34]

Xue, Haochen and Tang, Feilong and Hu, Ming and Liu, Yexin and Huang, Qidong and Li, Yulong and Liu, Chengzhi and Xu, Zhongxing and Zhang, Chong and Feng, Chun-Mei and Xie, Yutong and Razzak, Imran and Ge, Zongyuan and Su, Jionglong and He, Junjun and Qiao, Yu , journal =
[35]

Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao and Zhang, Yongfeng , booktitle =
[36]

Mem0: Building Production-Ready

Chhikara, Prateek and Khant, Dev and Aryan, Saket and Singh, Taranjeet and Yadav, Deshraj , journal =. Mem0: Building Production-Ready
[37]

Li, Zhiyu and Xi, Chenyang and Li, Chunyu and Chen, Ding and Chen, Boyu and Song, Shichao and Niu, Simin and Wang, Hanyu and Yang, Jiawei and Tang, Chen and Yu, Qingchen and Zhao, Jihao and Wang, Yezhaohui and Liu, Peng and Lin, Zehao and Wang, Pengyuan and Huo, Jiahao and Chen, Tianyi and Chen, Kai and Li, Kehang and Tao, Zhen and Lai, Huayi and Wu, Hao ...
[38]

Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and Chang, Kai-Wei and Yu, Dong , booktitle =
[39]

Chen, Zhiyu and Li, Shiyang and Smiley, Charese and Ma, Zhiqiang and Shah, Sameena and Wang, William Yang , booktitle =
[40]

Feng, Song and Wan, Hui and Gunasekara, Chulaka and Patel, Siva and Joshi, Sachindra and Lastras, Luis , booktitle =
[41]

Feng, Song and Patel, Siva and Wan, Hui and Joshi, Sachindra , booktitle =
[42]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and others , journal =
[43]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and others , journal =
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[45]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems (NeurIPS) , year =
[46]

Yeo, Woongyeong and Kim, Kangsan and Jeong, Soyeong and Baek, Jinheon and Hwang, Sung Ju , journal =
[47]

arXiv preprint arXiv:2510.12323 , year =

RAG-Anything: All-in-One RAG Framework , author =. arXiv preprint arXiv:2510.12323 , year =

arXiv
[48]

arXiv preprint arXiv:2512.03627 , year =

MemVerse: Multimodal Memory for Lifelong Learning Agents , author =. arXiv preprint arXiv:2512.03627 , year =

Pith/arXiv arXiv
[49]

Neural Graph Memory: A Structured Approach to Long-Term Memory in Multimodal Agents , author =
[50]

arXiv preprint arXiv:2507.07957 , year =

MIRIX: Multi-Agent Memory System for LLM-Based Agents , author =. arXiv preprint arXiv:2507.07957 , year =

Pith/arXiv arXiv
[51]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle =
[52]

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =
[53]

Introduction to Information Retrieval , author =
[54]

arXiv preprint arXiv:2510.13291 , year=

Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems , author=. arXiv preprint arXiv:2510.13291 , year=

arXiv
[55]

arXiv preprint arXiv:2602.16313 , year=

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks , author=. arXiv preprint arXiv:2602.16313 , year=

arXiv
[56]

arXiv preprint arXiv:2601.03515 , year=

Mem-gallery: Benchmarking multimodal long-term conversational memory for mllm agents , author=. arXiv preprint arXiv:2601.03515 , year=

arXiv
[57]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Slidevqa: A dataset for document visual question answering on multiple images , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[58]

Advances in Neural Information Processing Systems , volume=

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations , author=. Advances in Neural Information Processing Systems , volume=
[59]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025
[60]

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , booktitle =
[61]

Liu, Ziyu and Chu, Tao and Zang, Yuhang and Wei, Xilin and Dong, Xiaoyi and Zhang, Pan and Liang, Zijian and Xiong, Yuanjun and Qiao, Yu and Lin, Dahua and Wang, Jiaqi , journal =
[62]

Hu, Yuanzhe and others , journal =
[63]

arXiv preprint arXiv:2507.05257 , year=

Evaluating memory in llm agents via incremental multi-turn interactions , author=. arXiv preprint arXiv:2507.05257 , year=

Pith/arXiv arXiv
[64]

2026 , month =

Introducing. 2026 , month =

2026
[65]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Memory os of ai agent , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

2007

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

2005

[8] [8]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

1965

[9] [9]

Making the

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , booktitle =. Making the

[10] [10]

Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle =

[11] [11]

Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul , booktitle =

[12] [12]

and Kumar, Pratyush , booktitle =

Methani, Nitesh and Ganguly, Pritha and Khapra, Mitesh M. and Kumar, Pratyush , booktitle =

[13] [13]

arXiv preprint arXiv:1710.07300 , year =

Kahou, Samira Ebrahimi and Michalski, Vincent and Atkinson, Adam and K. arXiv preprint arXiv:1710.07300 , year =

Pith/arXiv arXiv

[14] [14]

Kafle, Kushal and Price, Brian and Cohen, Scott and Kanan, Christopher , booktitle =

[15] [15]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

A Diagram Is Worth a Dozen Images , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

[16] [16]

Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =

[17] [17]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

Mathew, Minesh and Bagal, Viraj and Tito, Rub. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

[18] [18]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Image-Chat: Engaging Grounded Conversations , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[19] [19]

Zang, Xiaoxue and Liu, Lijuan and Wang, Maria and Song, Yang and Zhang, Hao and Chen, Jindong , booktitle =

[20] [20]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[21] [21]

Feng, Jiazhan and Sun, Qingfeng and Xu, Can and Zhao, Pu and Yang, Yaming and Tao, Chongyang and Zhao, Dongyan and Lin, Qingwei , booktitle =

[22] [22]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Beyond Goldfish Memory: Long-Term Open-Domain Conversation , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[23] [23]

Evaluating Very Long-Term Conversational Memory of

Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , booktitle =. Evaluating Very Long-Term Conversational Memory of

[24] [24]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[25] [25]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford

[26] [26]

Findings of the Association for Computational Linguistics: NAACL 2024 , year =

Faithful Persona-based Conversational Dataset Generation with Large Language Models , author =. Findings of the Association for Computational Linguistics: NAACL 2024 , year =

2024

[27] [27]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

[28] [28]

Han, Yucheng and Zhang, Chi and Chen, Xin and Yang, Xu and Wang, Zhibin and Yu, Gang and Fu, Bin and Zhang, Hanwang , booktitle =

[29] [29]

2023 , howpublished =

2023

[30] [30]

arXiv preprint arXiv:2502.13923 , year=

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[31] [31]

2024 , url=

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=. 2024 , url=

2024

[32] [32]

arXiv preprint arXiv:2403.05530 , year=

Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context , author=. arXiv preprint arXiv:2403.05530 , year=

Pith/arXiv arXiv

[33] [33]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

[34] [34]

Xue, Haochen and Tang, Feilong and Hu, Ming and Liu, Yexin and Huang, Qidong and Li, Yulong and Liu, Chengzhi and Xu, Zhongxing and Zhang, Chong and Feng, Chun-Mei and Xie, Yutong and Razzak, Imran and Ge, Zongyuan and Su, Jionglong and He, Junjun and Qiao, Yu , journal =

[35] [35]

Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao and Zhang, Yongfeng , booktitle =

[36] [36]

Mem0: Building Production-Ready

Chhikara, Prateek and Khant, Dev and Aryan, Saket and Singh, Taranjeet and Yadav, Deshraj , journal =. Mem0: Building Production-Ready

[37] [37]

Li, Zhiyu and Xi, Chenyang and Li, Chunyu and Chen, Ding and Chen, Boyu and Song, Shichao and Niu, Simin and Wang, Hanyu and Yang, Jiawei and Tang, Chen and Yu, Qingchen and Zhao, Jihao and Wang, Yezhaohui and Liu, Peng and Lin, Zehao and Wang, Pengyuan and Huo, Jiahao and Chen, Tianyi and Chen, Kai and Li, Kehang and Tao, Zhen and Lai, Huayi and Wu, Hao ...

[38] [38]

Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and Chang, Kai-Wei and Yu, Dong , booktitle =

[39] [39]

Chen, Zhiyu and Li, Shiyang and Smiley, Charese and Ma, Zhiqiang and Shah, Sameena and Wang, William Yang , booktitle =

[40] [40]

Feng, Song and Wan, Hui and Gunasekara, Chulaka and Patel, Siva and Joshi, Sachindra and Lastras, Luis , booktitle =

[41] [41]

Feng, Song and Patel, Siva and Wan, Hui and Joshi, Sachindra , booktitle =

[42] [42]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and others , journal =

[43] [43]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and others , journal =

[44] [44]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[45] [45]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems (NeurIPS) , year =

[46] [46]

Yeo, Woongyeong and Kim, Kangsan and Jeong, Soyeong and Baek, Jinheon and Hwang, Sung Ju , journal =

[47] [47]

arXiv preprint arXiv:2510.12323 , year =

RAG-Anything: All-in-One RAG Framework , author =. arXiv preprint arXiv:2510.12323 , year =

arXiv

[48] [48]

arXiv preprint arXiv:2512.03627 , year =

MemVerse: Multimodal Memory for Lifelong Learning Agents , author =. arXiv preprint arXiv:2512.03627 , year =

Pith/arXiv arXiv

[49] [49]

Neural Graph Memory: A Structured Approach to Long-Term Memory in Multimodal Agents , author =

[50] [50]

arXiv preprint arXiv:2507.07957 , year =

MIRIX: Multi-Agent Memory System for LLM-Based Agents , author =. arXiv preprint arXiv:2507.07957 , year =

Pith/arXiv arXiv

[51] [51]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle =

[52] [52]

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =

[53] [53]

Introduction to Information Retrieval , author =

[54] [54]

arXiv preprint arXiv:2510.13291 , year=

Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems , author=. arXiv preprint arXiv:2510.13291 , year=

arXiv

[55] [55]

arXiv preprint arXiv:2602.16313 , year=

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks , author=. arXiv preprint arXiv:2602.16313 , year=

arXiv

[56] [56]

arXiv preprint arXiv:2601.03515 , year=

Mem-gallery: Benchmarking multimodal long-term conversational memory for mllm agents , author=. arXiv preprint arXiv:2601.03515 , year=

arXiv

[57] [57]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Slidevqa: A dataset for document visual question answering on multiple images , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[58] [58]

Advances in Neural Information Processing Systems , volume=

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations , author=. Advances in Neural Information Processing Systems , volume=

[59] [59]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025

[60] [60]

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , booktitle =

[61] [61]

Liu, Ziyu and Chu, Tao and Zang, Yuhang and Wei, Xilin and Dong, Xiaoyi and Zhang, Pan and Liang, Zijian and Xiong, Yuanjun and Qiao, Yu and Lin, Dahua and Wang, Jiaqi , journal =

[62] [62]

Hu, Yuanzhe and others , journal =

[63] [63]

arXiv preprint arXiv:2507.05257 , year=

Evaluating memory in llm agents via incremental multi-turn interactions , author=. arXiv preprint arXiv:2507.05257 , year=

Pith/arXiv arXiv

[64] [64]

2026 , month =

Introducing. 2026 , month =

2026

[65] [65]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Memory os of ai agent , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025