Task-Focused Memorization for Multimodal Agents

Hang Li; Tao Zou; Tian Qiu; Yichen He; Yuan Lin

arxiv: 2605.31075 · v1 · pith:XVPR62QUnew · submitted 2026-05-29 · 💻 cs.CV

Task-Focused Memorization for Multimodal Agents

Tao Zou , Yichen He , Tian Qiu , Yuan Lin , Hang Li This is my paper

Pith reviewed 2026-06-28 23:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords TaskMemmultimodal agentsmemorization policyreinforcement learningvideo question answeringstreaming benchmarkslong-term memoryembodied agents

0 comments

The pith

TaskMem uses reinforcement learning to train multimodal agents on what observations to retain for future tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames the challenge of long-term memory in multimodal agents as deciding what to memorize from continuous streams of video and other observations rather than how to store it. It presents TaskMem as a reinforcement-learning framework with a two-phase process: the first phase builds basic memory fidelity, while the second phase tunes the policy after deployment by using rewards derived from recent tasks the agent has performed. This selective approach is evaluated on reformulated streaming versions of VideoMME, EgoLife, and EgoTempo, where the agent must answer questions using only its memory and no access to the original video input. Built on Qwen3-VL-30B-A3B, the method reports accuracy gains of 6.3 percent, 7.0 percent, and 5.3 percent on the three benchmarks.

Core claim

TaskMem is a reinforcement-learning-based framework that enables the memorization policy to dynamically adjust its focus to the demands of real tasks encountered in the environment through a two-phase training paradigm, where Phase Two uses recent environment tasks to define a reward model that guides the policy toward task-relevant content.

What carries the argument

Task-focused Memorization Policy Learning (TaskMem), a two-phase RL process in which Phase One optimizes memory quality under fidelity requirements and Phase Two tunes an adapter on the base MLLM using a reward model from recent tasks.

If this is right

Agents can answer video questions using only stored memory without re-accessing raw observations.
Memorization becomes task-conditioned rather than uniform across all incoming data.
The policy can be updated post-deployment without retraining the entire base model.
Performance on long-horizon multimodal tasks improves when memory is selective instead of exhaustive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Memory design in embodied agents may shift from capacity-focused modules to policy-focused selection mechanisms.
The two-phase structure could be tested in non-video domains such as robotic control or document streams to check if task-defined rewards remain effective.
If recent tasks poorly predict future ones, the approach suggests adding an explicit prediction step for upcoming task types before reward definition.

Load-bearing premise

The reward model built from recent tasks will correctly identify which content will prove valuable for future tasks the agent has not yet seen.

What would settle it

Measure whether the reported accuracy gains on the streaming benchmarks disappear or reverse when the agent is evaluated on a new set of tasks whose relevant observations were not rewarded during the Phase Two adaptation period.

read the original abstract

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TaskMem's two-phase RL framing for selective memory is a clean idea but the reported gains rest on an untested assumption that rewards from recent tasks will pick content useful for future unseen ones.

read the letter

The paper's core move is to treat memorization as a policy that first learns basic fidelity then gets tuned post-deployment by an adapter whose reward comes from recent environment tasks. That separation is the main novelty; it is not just another memory module but a way to make the selection itself task-driven via RL.

What works is the benchmark reformulation. Turning VideoMME, EgoLife, and EgoTempo into streaming VQA settings where answers must come from memory alone is a useful testbed for continual agents, and the 6-plus percent lifts on Qwen3-VL are concrete enough to notice.

The soft spot is the generalization claim. Phase Two defines the reward from recent tasks, yet nothing in the abstract shows that this reward identifies content valuable for tasks the agent has not yet seen. Without an ablation that holds out future task distributions or measures how much the gains depend on task overlap, the improvements could reflect fitting to correlated data rather than forward-looking selection. The lack of baselines, error bars, or phase ablations in the reported numbers makes it hard to judge how much the two-phase structure actually contributes.

This is aimed at researchers building memory systems for embodied or long-video agents who already work with RL adapters. A reader looking for a practical recipe to try on their own streaming setup could extract the two-phase template and the memory-only evaluation protocol.

It deserves a serious referee. The problem is real, the framing is distinct from standard memory modules, and the empirical direction is worth checking even if the current evidence is preliminary. I would send it out but flag the need for explicit tests on reward generalization and full experimental details.

Referee Report

2 major / 1 minor

Summary. The paper introduces TaskMem, a reinforcement-learning framework for task-focused memorization in multimodal agents operating on unbounded streams of observations. It uses a two-phase paradigm: Phase One optimizes a memorization policy for memory quality and fidelity requirements, while Phase Two (post-deployment) tunes an adapter on the base MLLM via a reward model derived from recent environment tasks to guide selection of task-relevant content. The method is evaluated by reformulating VideoMME, EgoLife, and EgoTempo into streaming benchmarks where VQA questions must be answered from memory alone (no raw video access), reporting accuracy gains of 6.3%, 7.0%, and 5.3% on Qwen3-VL-30B-A3B.

Significance. If the reported gains are shown to arise from genuine forward generalization rather than task overlap, the framework would address a central open problem in continual learning for embodied and multimodal agents by making memorization policy learnable and task-adaptive. The two-phase separation and streaming-benchmark reformulation are practical contributions that could influence memory design in agent systems.

major comments (2)

[Abstract] Abstract: The central performance claim (accuracy gains of 6.3/7.0/5.3%) is presented without any description of the baselines, statistical significance tests, error bars, data splits, or ablation isolating Phase One versus Phase Two contributions. These omissions make it impossible to assess whether the gains are load-bearing evidence for the framework.
[Phase Two description] Phase Two description (abstract): The reward model is constructed from recent environment tasks to tune the memorization policy, yet the manuscript provides no ablation or held-out evaluation that separates the recent-task distribution from the future tasks on which performance is measured. Without this, the reported improvements on the streaming benchmarks do not yet demonstrate that the policy generalizes to unseen future tasks rather than exploiting distributional overlap.

minor comments (1)

[Abstract] The abstract states that questions are answered using only the agent's memory but does not specify how memory retrieval or conditioning is implemented at inference time; a brief description would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful feedback highlighting the need for greater clarity in the abstract and stronger evidence of generalization in Phase Two. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting the current results.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (accuracy gains of 6.3/7.0/5.3%) is presented without any description of the baselines, statistical significance tests, error bars, data splits, or ablation isolating Phase One versus Phase Two contributions. These omissions make it impossible to assess whether the gains are load-bearing evidence for the framework.

Authors: We agree the abstract is overly concise due to length limits and omits key experimental details. The full manuscript (Sections 4 and 5) specifies the baselines (non-RL memorization policies and standard MLLM memory baselines), reports results aggregated over multiple random seeds with statistical significance testing, includes error bars, details the train/test splits used in the reformulated streaming benchmarks, and provides ablations isolating Phase One (fidelity optimization) from Phase Two (task-adaptive tuning). We will revise the abstract to briefly reference these elements and direct readers to the relevant sections. revision: yes
Referee: [Phase Two description] Phase Two description (abstract): The reward model is constructed from recent environment tasks to tune the memorization policy, yet the manuscript provides no ablation or held-out evaluation that separates the recent-task distribution from the future tasks on which performance is measured. Without this, the reported improvements on the streaming benchmarks do not yet demonstrate that the policy generalizes to unseen future tasks rather than exploiting distributional overlap.

Authors: The streaming benchmarks are constructed so that tasks arrive online after the observation stream has been processed, with the reward model derived only from tasks encountered up to the current point. However, we acknowledge that the current results do not include an explicit ablation or held-out evaluation that strictly separates the recent-task distribution used for the reward model from the future tasks used for final measurement. We will add such a held-out ablation in the revision to directly demonstrate generalization beyond distributional overlap. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical RL training procedure with independent benchmark evaluation

full rationale

The paper describes TaskMem as a two-phase reinforcement learning framework for learning a memorization policy, with Phase One optimizing memory fidelity and Phase Two using a reward model derived from recent tasks to tune an adapter. Reported gains (6.3/7.0/5.3% VQA accuracy) are measured empirically on reformulated streaming benchmarks (VideoMME, EgoLife, EgoTempo) where questions are answered from memory alone. No equations, derivations, or self-citations are presented that reduce the claimed improvements to fitted parameters or inputs by construction. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the reward model and adapter are introduced but their parameterization is not specified.

pith-pipeline@v0.9.1-grok · 5838 in / 1108 out tokens · 21398 ms · 2026-06-28T23:13:18.989735+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 34 canonical work pages · 19 internal anchors

[1]

The surprising effectiveness of test-time training for few-shot learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning, pages 942–963. PMLR, 2025

2025
[2]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

2024
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Hamann, Jingrui He, and Hanghang Tong

Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik F. Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conversational memory for MLLM agents.CoRR, abs/2601.03515, 2026

work page arXiv 2026
[6]

Local learning algorithms.Neural computation, 4(6):888–900, 1992

Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

1992
[7]

Telemem: Building long-term and multimodal memory for agentic AI.CoRR, abs/2601.06037, 2026

Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, and Xuelong Li. Telemem: Building long-term and multimodal memory for agentic AI.CoRR, abs/2601.06037, 2026

work page arXiv 2026
[8]

Adaptive retention & correction: Test-time training for continual learning

Haoran Chen, Micah Goldblum, Zuxuan Wu, and Yu-Gang Jiang. Adaptive retention & correction: Test-time training for continual learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[9]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Ziyan Weng, and Yingwei Zhang. Polarmem: A training-free polarized latent graph memory for verifiable multimodal agents.CoRR, abs/2602.00415, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature machine intelligence, 5(3):220–235, 2023

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature machine intelligence, 5(3):220–235, 2023

2023
[14]

Dytox: Transformers for continual learning with dynamic token expansion

Arthur Douillard, Alexandre Ramé, Guillaume Couairon, and Matthieu Cord. Dytox: Transformers for continual learning with dynamic token expansion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9285–9295, 2022

2022
[15]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Videoagent: A memory-augmented multimodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXII, Lecture Notes in Computer Science, pages 75–92. Springer, 2024

2024
[17]

Robix: A unified model for robot interaction, reasoning and planning.arXiv preprint arXiv:2509.01106, 2025

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning.arXiv preprint arXiv:2509.01106, 2025. 20

work page arXiv 2025
[18]

M2A: multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, and Wentao Zhang. M2A: multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions. CoRR, abs/2602.07624, 2026

work page arXiv 2026
[19]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

1999
[20]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and PatternRecognition Conference, pages 24108–24118, 2025

2025
[21]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Test-time training on nearest neighbors for large language models

Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[23]

MA-LMM: memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. MA-LMM: memory-augmented large multimodal model for long-term video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13504–13514. IEEE, 2024

2024
[24]

Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024

Yichen He, Yuan Lin, Jianchao Wu, Hanchong Zhang, Yuchen Zhang, and Ruicheng Le. Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024

work page arXiv 2024
[25]

A definition of agi.arXiv preprint arXiv:2510.18212, 2025

Dan Hendrycks, Dawn Song, Christian Szegedy, Honglak Lee, Yarin Gal, Erik Brynjolfsson, Sharon Li, Andy Zou, Lionel Levine, Bo Han, et al. A definition of agi.arXiv preprint arXiv:2510.18212, 2025

work page arXiv 2025
[26]

Test-time learning for large language models

Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. InForty-secondInternationalConference on MachineLearning, 2025

2025
[27]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 5254–5276, 2023

2023
[29]

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

work page arXiv 2026
[30]

Efficiently learning at test-time: Active fine-tuning of llms

Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of llms. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[31]

Compact- ing, picking and growing for unforgetting continual learning.Advancesin neural information processing systems, 32, 2019

Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compact- ing, picking and growing for unforgetting continual learning.Advancesin neural information processing systems, 32, 2019

2019
[32]

Advancing multimodal agent reasoning with long-term neuro-symbolic memory.arXiv preprint arXiv:2603.15280, 2026

Rongjie Jiang, Jianwei Wang, Gengda Zhao, Chengyang Luo, Kai Wang, and Wenjie Zhang. Advancing multimodal agent reasoning with long-term neuro-symbolic memory.arXiv preprint arXiv:2603.15280, 2026

work page arXiv 2026
[33]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[34]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37: 49881–49913, 2024

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37: 49881–49913, 2024. 21

2024
[35]

Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding.arXiv preprint arXiv:2504.10739, 2025

Yueqian Lin, Jingyang Zhang, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Yudong Liu, Hai Li, Yiran Chen, et al. Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding.arXiv preprint arXiv:2504.10739, 2025

work page arXiv 2025
[36]

MemVerse: Multimodal Memory for Lifelong Learning Agents

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, and Ding Wang. Memverse: Multimodal memory for lifelong learning agents.CoRR, abs/2512.03627, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory.CoRR, abs/2508.09736, 2025

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory.CoRR, abs/2508.09736, 2025

work page arXiv 2025
[38]

Mma: Multimodal memory agent

Yihao Lu, Wanru Cheng, Zeyu Zhang, and Hao Tang. Mma: Multimodal memory agent. arXiv preprint arXiv:2602.16493, 2026

work page arXiv 2026
[39]

OpenAI. Gpt-5.2. https://openai.com/index/introducing-gpt-5-2/, 2025

2025
[40]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023
[42]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24129–24138, 2025

2025
[44]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

2023
[45]

Meminsight: Autonomous memory augmentation for LLM agents

Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. Meminsight: Autonomous memory augmentation for LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings ofthe2025ConferenceonEmpiricalMethods in NaturalLanguageProcessing, EMNLP2025, Suzhou, China, N...

2025
[46]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self- critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. Technical report, Technical report, Bytedance, 2025. URL https://lf3-static. bytednsdoc. com ..., 2026

2025
[48]

The frame problem

Murray Shanahan. The frame problem. https://plato.stanford.edu/entries/frame-problem/, 2004

2004
[49]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA,USA, June 16-22, 2024, pa...

2024
[51]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

2024
[55]

Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 34153–34189, 2023

2023
[56]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactionson Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

1907
[57]

Karma: Augmenting embodied ai agents with long-and-short term memory systems

Zixuan Wang, Bo Yu, Junzhe Zhao, Wenhao Sun, Sai Hou, Shuai Liang, Xing Hu, Yinhe Han, and Yiming Gan. Karma: Augmenting embodied ai agents with long-and-short term memory systems. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2025

2025
[58]

How memory management impacts LLM agents: An empirical study of experience-following behavior.CoRR, abs/2505.16067, 2025

Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, and Zhen Xiang. How memory management impacts LLM agents: An empirical study of experience-following behavior.CoRR, abs/2505.16067, 2025

work page arXiv 2025
[59]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

2025
[61]

Embodied multi-modal agent trained by an llm from a parallel textworld

Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26265–26275. IEEE, 2024

2024
[62]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[63]

Worldmm: Dynamic multimodal memory agent for long video reasoning.CoRR, abs/2512.02425, 2025

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.CoRR, abs/2512.02425, 2025. doi: 10.48550/ARXIV.2512.02425. URLhttps: //doi.org/10.48550/arXiv.2512.02425

work page doi:10.48550/arxiv.2512.02425 2025
[64]

Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.ArXiv preprint, abs/2501.07888, 2025

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025

work page arXiv 2025
[65]

Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

work page arXiv 2025
[66]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

start_time

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 23 Appendix A Implementation Details of Tools Here, we provide the implementation details of the tools for representation ...

2025
[71]

description

Characters'Contextual Behavior: Describe the characters'roles in the scene or their interaction with other characters, focusing on their behavior, emotional state, or relationships. Strict Requirements: - If a character has an associated face ID in the video, refer to them ONLY using that face ID. - If characters DO NOT have associated face IDs in the who...
[72]

Characters'Appearance: Describe the characters'appearance, such as their clothing, facial features, or any distinguishing characteristics
[73]

Characters'Actions & Movements: Describe specific gestures, movements, or interactions performed by the characters
[74]

Characters'Spoken Dialogue: Transcribe or summarize what is spoken by the characters
[75]

description

Characters'Contextual Behavior: Describe the characters'roles in the scene or their interaction with other characters, focusing on their behavior, emotional state, or relationships. Strict Requirements: - If a character has an associated face ID in the video, refer to them ONLY using that face ID. - If characters DO NOT have associated face IDs in the who...
[76]

Whether the candidate description is factually accurate based only on visual content and subtitles (ignore audio)
[77]

continue

Whether it connects coherently and naturally with the preceding description, without using transition words such as "continue". For any spoken content, verify it solely against the displayed subtitles and disregard audio information. Assign exactly one label: 1: Correct — The description that meets all of the above criteria. 0: Incorrect — Any description...

[1] [1]

The surprising effectiveness of test-time training for few-shot learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning, pages 942–963. PMLR, 2025

2025

[2] [2]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

2024

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Hamann, Jingrui He, and Hanghang Tong

Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik F. Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conversational memory for MLLM agents.CoRR, abs/2601.03515, 2026

work page arXiv 2026

[6] [6]

Local learning algorithms.Neural computation, 4(6):888–900, 1992

Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

1992

[7] [7]

Telemem: Building long-term and multimodal memory for agentic AI.CoRR, abs/2601.06037, 2026

Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, and Xuelong Li. Telemem: Building long-term and multimodal memory for agentic AI.CoRR, abs/2601.06037, 2026

work page arXiv 2026

[8] [8]

Adaptive retention & correction: Test-time training for continual learning

Haoran Chen, Micah Goldblum, Zuxuan Wu, and Yu-Gang Jiang. Adaptive retention & correction: Test-time training for continual learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[9] [9]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Ziyan Weng, and Yingwei Zhang. Polarmem: A training-free polarized latent graph memory for verifiable multimodal agents.CoRR, abs/2602.00415, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature machine intelligence, 5(3):220–235, 2023

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature machine intelligence, 5(3):220–235, 2023

2023

[14] [14]

Dytox: Transformers for continual learning with dynamic token expansion

Arthur Douillard, Alexandre Ramé, Guillaume Couairon, and Matthieu Cord. Dytox: Transformers for continual learning with dynamic token expansion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9285–9295, 2022

2022

[15] [15]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Videoagent: A memory-augmented multimodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXII, Lecture Notes in Computer Science, pages 75–92. Springer, 2024

2024

[17] [17]

Robix: A unified model for robot interaction, reasoning and planning.arXiv preprint arXiv:2509.01106, 2025

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning.arXiv preprint arXiv:2509.01106, 2025. 20

work page arXiv 2025

[18] [18]

M2A: multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, and Wentao Zhang. M2A: multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions. CoRR, abs/2602.07624, 2026

work page arXiv 2026

[19] [19]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

1999

[20] [20]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and PatternRecognition Conference, pages 24108–24118, 2025

2025

[21] [21]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

Test-time training on nearest neighbors for large language models

Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[23] [23]

MA-LMM: memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. MA-LMM: memory-augmented large multimodal model for long-term video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13504–13514. IEEE, 2024

2024

[24] [24]

Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024

Yichen He, Yuan Lin, Jianchao Wu, Hanchong Zhang, Yuchen Zhang, and Ruicheng Le. Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024

work page arXiv 2024

[25] [25]

A definition of agi.arXiv preprint arXiv:2510.18212, 2025

Dan Hendrycks, Dawn Song, Christian Szegedy, Honglak Lee, Yarin Gal, Erik Brynjolfsson, Sharon Li, Andy Zou, Lionel Levine, Bo Han, et al. A definition of agi.arXiv preprint arXiv:2510.18212, 2025

work page arXiv 2025

[26] [26]

Test-time learning for large language models

Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. InForty-secondInternationalConference on MachineLearning, 2025

2025

[27] [27]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 5254–5276, 2023

2023

[29] [29]

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

work page arXiv 2026

[30] [30]

Efficiently learning at test-time: Active fine-tuning of llms

Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of llms. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[31] [31]

Compact- ing, picking and growing for unforgetting continual learning.Advancesin neural information processing systems, 32, 2019

Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compact- ing, picking and growing for unforgetting continual learning.Advancesin neural information processing systems, 32, 2019

2019

[32] [32]

Advancing multimodal agent reasoning with long-term neuro-symbolic memory.arXiv preprint arXiv:2603.15280, 2026

Rongjie Jiang, Jianwei Wang, Gengda Zhao, Chengyang Luo, Kai Wang, and Wenjie Zhang. Advancing multimodal agent reasoning with long-term neuro-symbolic memory.arXiv preprint arXiv:2603.15280, 2026

work page arXiv 2026

[33] [33]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[34] [34]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37: 49881–49913, 2024

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37: 49881–49913, 2024. 21

2024

[35] [35]

Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding.arXiv preprint arXiv:2504.10739, 2025

Yueqian Lin, Jingyang Zhang, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Yudong Liu, Hai Li, Yiran Chen, et al. Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding.arXiv preprint arXiv:2504.10739, 2025

work page arXiv 2025

[36] [36]

MemVerse: Multimodal Memory for Lifelong Learning Agents

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, and Ding Wang. Memverse: Multimodal memory for lifelong learning agents.CoRR, abs/2512.03627, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory.CoRR, abs/2508.09736, 2025

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory.CoRR, abs/2508.09736, 2025

work page arXiv 2025

[38] [38]

Mma: Multimodal memory agent

Yihao Lu, Wanru Cheng, Zeyu Zhang, and Hao Tang. Mma: Multimodal memory agent. arXiv preprint arXiv:2602.16493, 2026

work page arXiv 2026

[39] [39]

OpenAI. Gpt-5.2. https://openai.com/index/introducing-gpt-5-2/, 2025

2025

[40] [40]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023

[42] [42]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24129–24138, 2025

2025

[44] [44]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

2023

[45] [45]

Meminsight: Autonomous memory augmentation for LLM agents

Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. Meminsight: Autonomous memory augmentation for LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings ofthe2025ConferenceonEmpiricalMethods in NaturalLanguageProcessing, EMNLP2025, Suzhou, China, N...

2025

[46] [46]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self- critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. Technical report, Technical report, Bytedance, 2025. URL https://lf3-static. bytednsdoc. com ..., 2026

2025

[48] [48]

The frame problem

Murray Shanahan. The frame problem. https://plato.stanford.edu/entries/frame-problem/, 2004

2004

[49] [49]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA,USA, June 16-22, 2024, pa...

2024

[51] [51]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 22

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

2024

[55] [55]

Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 34153–34189, 2023

2023

[56] [56]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactionson Pattern Analysis and Machine Intelligence, 47(3):1894–1907, 2024

1907

[57] [57]

Karma: Augmenting embodied ai agents with long-and-short term memory systems

Zixuan Wang, Bo Yu, Junzhe Zhao, Wenhao Sun, Sai Hou, Shuai Liang, Xing Hu, Yinhe Han, and Yiming Gan. Karma: Augmenting embodied ai agents with long-and-short term memory systems. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2025

2025

[58] [58]

How memory management impacts LLM agents: An empirical study of experience-following behavior.CoRR, abs/2505.16067, 2025

Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, and Zhen Xiang. How memory management impacts LLM agents: An empirical study of experience-following behavior.CoRR, abs/2505.16067, 2025

work page arXiv 2025

[59] [59]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

2025

[61] [61]

Embodied multi-modal agent trained by an llm from a parallel textworld

Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26265–26275. IEEE, 2024

2024

[62] [62]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023

[63] [63]

Worldmm: Dynamic multimodal memory agent for long video reasoning.CoRR, abs/2512.02425, 2025

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning.CoRR, abs/2512.02425, 2025. doi: 10.48550/ARXIV.2512.02425. URLhttps: //doi.org/10.48550/arXiv.2512.02425

work page doi:10.48550/arxiv.2512.02425 2025

[64] [64]

Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.ArXiv preprint, abs/2501.07888, 2025

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025

work page arXiv 2025

[65] [65]

Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

work page arXiv 2025

[66] [66]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

start_time

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 23 Appendix A Implementation Details of Tools Here, we provide the implementation details of the tools for representation ...

2025

[68] [71]

description

Characters'Contextual Behavior: Describe the characters'roles in the scene or their interaction with other characters, focusing on their behavior, emotional state, or relationships. Strict Requirements: - If a character has an associated face ID in the video, refer to them ONLY using that face ID. - If characters DO NOT have associated face IDs in the who...

[69] [72]

Characters'Appearance: Describe the characters'appearance, such as their clothing, facial features, or any distinguishing characteristics

[70] [73]

Characters'Actions & Movements: Describe specific gestures, movements, or interactions performed by the characters

[71] [74]

Characters'Spoken Dialogue: Transcribe or summarize what is spoken by the characters

[72] [75]

description

Characters'Contextual Behavior: Describe the characters'roles in the scene or their interaction with other characters, focusing on their behavior, emotional state, or relationships. Strict Requirements: - If a character has an associated face ID in the video, refer to them ONLY using that face ID. - If characters DO NOT have associated face IDs in the who...

[73] [76]

Whether the candidate description is factually accurate based only on visual content and subtitles (ignore audio)

[74] [77]

continue

Whether it connects coherently and naturally with the preceding description, without using transition words such as "continue". For any spoken content, verify it solely against the displayed subtitles and disregard audio information. Assign exactly one label: 1: Correct — The description that meets all of the above criteria. 0: Incorrect — Any description...