pith. sign in

arxiv: 2605.31075 · v1 · pith:XVPR62QUnew · submitted 2026-05-29 · 💻 cs.CV

Task-Focused Memorization for Multimodal Agents

Pith reviewed 2026-06-28 23:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords TaskMemmultimodal agentsmemorization policyreinforcement learningvideo question answeringstreaming benchmarkslong-term memoryembodied agents
0
0 comments X

The pith

TaskMem uses reinforcement learning to train multimodal agents on what observations to retain for future tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames the challenge of long-term memory in multimodal agents as deciding what to memorize from continuous streams of video and other observations rather than how to store it. It presents TaskMem as a reinforcement-learning framework with a two-phase process: the first phase builds basic memory fidelity, while the second phase tunes the policy after deployment by using rewards derived from recent tasks the agent has performed. This selective approach is evaluated on reformulated streaming versions of VideoMME, EgoLife, and EgoTempo, where the agent must answer questions using only its memory and no access to the original video input. Built on Qwen3-VL-30B-A3B, the method reports accuracy gains of 6.3 percent, 7.0 percent, and 5.3 percent on the three benchmarks.

Core claim

TaskMem is a reinforcement-learning-based framework that enables the memorization policy to dynamically adjust its focus to the demands of real tasks encountered in the environment through a two-phase training paradigm, where Phase Two uses recent environment tasks to define a reward model that guides the policy toward task-relevant content.

What carries the argument

Task-focused Memorization Policy Learning (TaskMem), a two-phase RL process in which Phase One optimizes memory quality under fidelity requirements and Phase Two tunes an adapter on the base MLLM using a reward model from recent tasks.

If this is right

  • Agents can answer video questions using only stored memory without re-accessing raw observations.
  • Memorization becomes task-conditioned rather than uniform across all incoming data.
  • The policy can be updated post-deployment without retraining the entire base model.
  • Performance on long-horizon multimodal tasks improves when memory is selective instead of exhaustive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Memory design in embodied agents may shift from capacity-focused modules to policy-focused selection mechanisms.
  • The two-phase structure could be tested in non-video domains such as robotic control or document streams to check if task-defined rewards remain effective.
  • If recent tasks poorly predict future ones, the approach suggests adding an explicit prediction step for upcoming task types before reward definition.

Load-bearing premise

The reward model built from recent tasks will correctly identify which content will prove valuable for future tasks the agent has not yet seen.

What would settle it

Measure whether the reported accuracy gains on the streaming benchmarks disappear or reverse when the agent is evaluated on a new set of tasks whose relevant observations were not rewarded during the Phase Two adaptation period.

read the original abstract

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TaskMem, a reinforcement-learning framework for task-focused memorization in multimodal agents operating on unbounded streams of observations. It uses a two-phase paradigm: Phase One optimizes a memorization policy for memory quality and fidelity requirements, while Phase Two (post-deployment) tunes an adapter on the base MLLM via a reward model derived from recent environment tasks to guide selection of task-relevant content. The method is evaluated by reformulating VideoMME, EgoLife, and EgoTempo into streaming benchmarks where VQA questions must be answered from memory alone (no raw video access), reporting accuracy gains of 6.3%, 7.0%, and 5.3% on Qwen3-VL-30B-A3B.

Significance. If the reported gains are shown to arise from genuine forward generalization rather than task overlap, the framework would address a central open problem in continual learning for embodied and multimodal agents by making memorization policy learnable and task-adaptive. The two-phase separation and streaming-benchmark reformulation are practical contributions that could influence memory design in agent systems.

major comments (2)
  1. [Abstract] Abstract: The central performance claim (accuracy gains of 6.3/7.0/5.3%) is presented without any description of the baselines, statistical significance tests, error bars, data splits, or ablation isolating Phase One versus Phase Two contributions. These omissions make it impossible to assess whether the gains are load-bearing evidence for the framework.
  2. [Phase Two description] Phase Two description (abstract): The reward model is constructed from recent environment tasks to tune the memorization policy, yet the manuscript provides no ablation or held-out evaluation that separates the recent-task distribution from the future tasks on which performance is measured. Without this, the reported improvements on the streaming benchmarks do not yet demonstrate that the policy generalizes to unseen future tasks rather than exploiting distributional overlap.
minor comments (1)
  1. [Abstract] The abstract states that questions are answered using only the agent's memory but does not specify how memory retrieval or conditioning is implemented at inference time; a brief description would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful feedback highlighting the need for greater clarity in the abstract and stronger evidence of generalization in Phase Two. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting the current results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (accuracy gains of 6.3/7.0/5.3%) is presented without any description of the baselines, statistical significance tests, error bars, data splits, or ablation isolating Phase One versus Phase Two contributions. These omissions make it impossible to assess whether the gains are load-bearing evidence for the framework.

    Authors: We agree the abstract is overly concise due to length limits and omits key experimental details. The full manuscript (Sections 4 and 5) specifies the baselines (non-RL memorization policies and standard MLLM memory baselines), reports results aggregated over multiple random seeds with statistical significance testing, includes error bars, details the train/test splits used in the reformulated streaming benchmarks, and provides ablations isolating Phase One (fidelity optimization) from Phase Two (task-adaptive tuning). We will revise the abstract to briefly reference these elements and direct readers to the relevant sections. revision: yes

  2. Referee: [Phase Two description] Phase Two description (abstract): The reward model is constructed from recent environment tasks to tune the memorization policy, yet the manuscript provides no ablation or held-out evaluation that separates the recent-task distribution from the future tasks on which performance is measured. Without this, the reported improvements on the streaming benchmarks do not yet demonstrate that the policy generalizes to unseen future tasks rather than exploiting distributional overlap.

    Authors: The streaming benchmarks are constructed so that tasks arrive online after the observation stream has been processed, with the reward model derived only from tasks encountered up to the current point. However, we acknowledge that the current results do not include an explicit ablation or held-out evaluation that strictly separates the recent-task distribution used for the reward model from the future tasks used for final measurement. We will add such a held-out ablation in the revision to directly demonstrate generalization beyond distributional overlap. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical RL training procedure with independent benchmark evaluation

full rationale

The paper describes TaskMem as a two-phase reinforcement learning framework for learning a memorization policy, with Phase One optimizing memory fidelity and Phase Two using a reward model derived from recent tasks to tune an adapter. Reported gains (6.3/7.0/5.3% VQA accuracy) are measured empirically on reformulated streaming benchmarks (VideoMME, EgoLife, EgoTempo) where questions are answered from memory alone. No equations, derivations, or self-citations are presented that reduce the claimed improvements to fitted parameters or inputs by construction. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the reward model and adapter are introduced but their parameterization is not specified.

pith-pipeline@v0.9.1-grok · 5838 in / 1108 out tokens · 21398 ms · 2026-06-28T23:13:18.989735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 34 canonical work pages · 19 internal anchors

  1. [1]

    The surprising effectiveness of test-time training for few-shot learning

    Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning, pages 942–963. PMLR, 2025

  2. [2]

    Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  5. [5]

    Hamann, Jingrui He, and Hanghang Tong

    Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik F. Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conversational memory for MLLM agents.CoRR, abs/2601.03515, 2026

  6. [6]

    Local learning algorithms.Neural computation, 4(6):888–900, 1992

    Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

  7. [7]

    Telemem: Building long-term and multimodal memory for agentic AI.CoRR, abs/2601.06037, 2026

    Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, and Xuelong Li. Telemem: Building long-term and multimodal memory for agentic AI.CoRR, abs/2601.06037, 2026

  8. [8]

    Adaptive retention & correction: Test-time training for continual learning

    Haoran Chen, Micah Goldblum, Zuxuan Wu, and Yu-Gang Jiang. Adaptive retention & correction: Test-time training for continual learning. InThe Thirteenth International Conference on Learning Representations, 2025

  9. [9]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  10. [10]

    PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

    Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Ziyan Weng, and Yingwei Zhang. Polarmem: A training-free polarized latent graph memory for verifiable multimodal agents.CoRR, abs/2602.00415, 2026

  11. [11]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  13. [13]

    Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature machine intelligence, 5(3):220–235, 2023

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature machine intelligence, 5(3):220–235, 2023

  14. [14]

    Dytox: Transformers for continual learning with dynamic token expansion

    Arthur Douillard, Alexandre Ramé, Guillaume Couairon, and Matthieu Cord. Dytox: Transformers for continual learning with dynamic token expansion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9285–9295, 2022

  15. [15]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

  16. [16]

    Videoagent: A memory-augmented multimodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXII, Lecture Notes in Computer Science, pages 75–92. Springer, 2024

  17. [17]

    Robix: A unified model for robot interaction, reasoning and planning.arXiv preprint arXiv:2509.01106, 2025

    Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning.arXiv preprint arXiv:2509.01106, 2025. 20

  18. [18]

    M2A: multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions

    Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, and Wentao Zhang. M2A: multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions. CoRR, abs/2602.07624, 2026

  19. [19]

    Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

    Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

  20. [20]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and PatternRecognition Conference, pages 24108–24118, 2025

  21. [21]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

  22. [22]

    Test-time training on nearest neighbors for large language models

    Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  23. [23]

    MA-LMM: memory-augmented large multimodal model for long-term video understanding

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. MA-LMM: memory-augmented large multimodal model for long-term video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13504–13514. IEEE, 2024

  24. [24]

    Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024

    Yichen He, Yuan Lin, Jianchao Wu, Hanchong Zhang, Yuchen Zhang, and Ruicheng Le. Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024

  25. [25]

    A definition of agi.arXiv preprint arXiv:2510.18212, 2025

    Dan Hendrycks, Dawn Song, Christian Szegedy, Honglak Lee, Yarin Gal, Erik Brynjolfsson, Sharon Li, Andy Zou, Lionel Levine, Bo Han, et al. A definition of agi.arXiv preprint arXiv:2510.18212, 2025

  26. [26]

    Test-time learning for large language models

    Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. InForty-secondInternationalConference on MachineLearning, 2025

  27. [27]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

  28. [28]

    Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 5254–5276, 2023

  29. [29]

    Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

  30. [30]

    Efficiently learning at test-time: Active fine-tuning of llms

    Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of llms. InThe Thirteenth International Conference on Learning Representations, 2025

  31. [31]

    Compact- ing, picking and growing for unforgetting continual learning.Advancesin neural information processing systems, 32, 2019

    Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compact- ing, picking and growing for unforgetting continual learning.Advancesin neural information processing systems, 32, 2019

  32. [32]

    Advancing multimodal agent reasoning with long-term neuro-symbolic memory.arXiv preprint arXiv:2603.15280, 2026

    Rongjie Jiang, Jianwei Wang, Gengda Zhao, Chengyang Luo, Kai Wang, and Wenjie Zhang. Advancing multimodal agent reasoning with long-term neuro-symbolic memory.arXiv preprint arXiv:2603.15280, 2026

  33. [33]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  34. [34]

    Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37: 49881–49913, 2024

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37: 49881–49913, 2024. 21

  35. [35]

    Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding.arXiv preprint arXiv:2504.10739, 2025

    Yueqian Lin, Jingyang Zhang, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Yudong Liu, Hai Li, Yiran Chen, et al. Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding.arXiv preprint arXiv:2504.10739, 2025

  36. [36]

    MemVerse: Multimodal Memory for Lifelong Learning Agents

    Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, and Ding Wang. Memverse: Multimodal memory for lifelong learning agents.CoRR, abs/2512.03627, 2025

  37. [37]

    Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory.CoRR, abs/2508.09736, 2025

    Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory.CoRR, abs/2508.09736, 2025

  38. [38]

    Mma: Multimodal memory agent

    Yihao Lu, Wanru Cheng, Zeyu Zhang, and Hao Tang. Mma: Multimodal memory agent. arXiv preprint arXiv:2602.16493, 2026

  39. [39]

    OpenAI. Gpt-5.2. https://openai.com/index/introducing-gpt-5-2/, 2025

  40. [40]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

  41. [41]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  42. [42]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023

  43. [43]

    Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos

    Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24129–24138, 2025

  44. [44]

    Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

  45. [45]

    Meminsight: Autonomous memory augmentation for LLM agents

    Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. Meminsight: Autonomous memory augmentation for LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings ofthe2025ConferenceonEmpiricalMethods in NaturalLanguageProcessing, EMNLP2025, Suzhou, China, N...

  46. [46]

    Self-critiquing models for assisting human evaluators

    William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self- critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802, 2022

  47. [47]

    Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. Technical report, Technical report, Bytedance, 2025. URL https://lf3-static. bytednsdoc. com ..., 2026

  48. [48]

    The frame problem

    Murray Shanahan. The frame problem. https://plato.stanford.edu/entries/frame-problem/, 2004

  49. [49]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  50. [50]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA,USA, June 16-22, 2024, pa...

  51. [51]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  52. [52]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  53. [53]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 22

  54. [54]

    A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

  55. [55]

    Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents

    Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 34153–34189, 2023

  56. [56]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactionson Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

  57. [57]

    Karma: Augmenting embodied ai agents with long-and-short term memory systems

    Zixuan Wang, Bo Yu, Junzhe Zhao, Wenhao Sun, Sai Hou, Shuai Liang, Xing Hu, Yinhe Han, and Yiming Gan. Karma: Augmenting embodied ai agents with long-and-short term memory systems. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2025

  58. [58]

    How memory management impacts LLM agents: An empirical study of experience-following behavior.CoRR, abs/2505.16067, 2025

    Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, and Zhen Xiang. How memory management impacts LLM agents: An empirical study of experience-following behavior.CoRR, abs/2505.16067, 2025

  59. [59]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  60. [60]

    Egolife: Towards egocentric life assistant

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

  61. [61]

    Embodied multi-modal agent trained by an llm from a parallel textworld

    Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26265–26275. IEEE, 2024

  62. [62]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  63. [63]

    Worldmm: Dynamic multimodal memory agent for long video reasoning.CoRR, abs/2512.02425, 2025

    Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.CoRR, abs/2512.02425, 2025. doi: 10.48550/ARXIV.2512.02425. URLhttps: //doi.org/10.48550/arXiv.2512.02425

  64. [64]

    Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.ArXiv preprint, abs/2501.07888, 2025

    Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025

  65. [65]

    Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

    Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

  66. [66]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  67. [67]

    start_time

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 23 Appendix A Implementation Details of Tools Here, we provide the implementation details of the tools for representation ...

  68. [71]

    description

    Characters'Contextual Behavior: Describe the characters'roles in the scene or their interaction with other characters, focusing on their behavior, emotional state, or relationships. Strict Requirements: - If a character has an associated face ID in the video, refer to them ONLY using that face ID. - If characters DO NOT have associated face IDs in the who...

  69. [72]

    Characters'Appearance: Describe the characters'appearance, such as their clothing, facial features, or any distinguishing characteristics

  70. [73]

    Characters'Actions & Movements: Describe specific gestures, movements, or interactions performed by the characters

  71. [74]

    Characters'Spoken Dialogue: Transcribe or summarize what is spoken by the characters

  72. [75]

    description

    Characters'Contextual Behavior: Describe the characters'roles in the scene or their interaction with other characters, focusing on their behavior, emotional state, or relationships. Strict Requirements: - If a character has an associated face ID in the video, refer to them ONLY using that face ID. - If characters DO NOT have associated face IDs in the who...

  73. [76]

    Whether the candidate description is factually accurate based only on visual content and subtitles (ignore audio)

  74. [77]

    continue

    Whether it connects coherently and naturally with the preceding description, without using transition words such as "continue". For any spoken content, verify it solely against the displayed subtitles and disregard audio information. Assign exactly one label: 1: Correct — The description that meets all of the above criteria. 0: Incorrect — Any description...