PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Ercong Nie; Haotong Wang; Jinhe Bi; Riccardo Trivisonno; Sicheng Dong; Sikuan Yan; Susanna Schwarzmann; Volker Tresp; Yilun Liu; Yingjie Xu

arxiv: 2605.17065 · v1 · pith:VCQKZ5YFnew · submitted 2026-05-16 · 💻 cs.MA

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Sikuan Yan , Sicheng Dong , Haotong Wang , Ercong Nie , Yilun Liu , Jinhe Bi , Yingjie Xu , Susanna Schwarzmann

show 3 more authors

Riccardo Trivisonno Volker Tresp Yunpu Ma

This is my paper

Pith reviewed 2026-05-20 15:07 UTC · model grok-4.3

classification 💻 cs.MA

keywords hierarchical multimodal memorylong-horizon video reasoningpyramid structureevent segmentationevidence aggregationmemory pruningvideo understanding

0 comments

The pith

PyraVid organizes long videos into a coarse-to-fine pyramid to enable structured multimodal memory access and evidence aggregation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PyraVid to address memory needs in systems that reason over extended multimodal experiences such as long videos. It builds a hierarchical pyramid that moves from broad event overviews down to fine-grained details, drawing on cognitive ideas about how people segment ongoing activity. This design targets the specific difficulties of blending different data types, aligning details around individuals, and collecting supporting evidence from multiple levels of detail. A sympathetic reader would care because effective long-horizon reasoning in agents depends on retrieving and combining past information without being overwhelmed by volume or noise.

Core claim

We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types.

What carries the argument

The coarse-to-fine pyramid structure for video organization, which performs structured access, evidence aggregation across granularities, and pruning-guided expansion to handle multimodal memory.

If this is right

Performance rises consistently on long-video benchmarks regardless of dataset, model scale, or question type.
Structured access supports aggregation of evidence from coarse overviews to fine segments.
Pruning during expansion retrieves causally connected events even when semantic similarity is low.
Noise decreases while maintaining coverage of multimodal and person-centric details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coarse-to-fine organization could be tested on sequential data outside video, such as audio logs or interaction histories.
Cognitive segmentation principles may offer a general template for designing memory in agents that handle mixed input streams.
Varying the number of pyramid levels or pruning thresholds could be measured to find task-specific optima.

Load-bearing premise

The pyramid hierarchy drawn from event segmentation will successfully integrate heterogeneous inputs, align person-centric information, and aggregate evidence across levels without creating fresh alignment or noise problems.

What would settle it

A side-by-side test on a long-video benchmark in which PyraVid produces no accuracy gain or higher retrieval noise than a non-hierarchical baseline memory.

Figures

Figures reproduced from arXiv: 2605.17065 by Ercong Nie, Haotong Wang, Jinhe Bi, Riccardo Trivisonno, Sicheng Dong, Sikuan Yan, Susanna Schwarzmann, Volker Tresp, Yilun Liu, Yingjie Xu, Yunpu Ma.

**Figure 2.** Figure 2: Controlled comparison between PyraVid and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 5.** Figure 5: Case study illustrating how iterative expansion over the hierarchical memory enables PyraVid [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for relational link generation among fact memories. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used for LLM-as-a-Judge evaluation with GPT-4o-mini. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for multiple-choice questions. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for open questions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used for multiple-choice node selection. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for open question node selection. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyraVid adds a pyramid hierarchy and pruning step to multimodal video memory, which looks like a practical step forward for long-horizon agents but rests on unshown alignment details.

read the letter

PyraVid organizes long videos into a coarse-to-fine pyramid and adds structure-guided expansion plus pruning to retrieve causally linked events that have low semantic overlap. That combination is the main new piece, extending earlier unimodal memory work to handle heterogeneous inputs and person-centric alignment in video settings. The cognitive science tie-in from Event Segmentation Theory gives the design a clear rationale rather than pure engineering guesswork. The reported consistent gains across datasets, model scales, and question types suggest the structure helps evidence aggregation in practice, which is the part that would interest people building agentic systems. Credit is due for targeting the specific multimodal pain points instead of just scaling up existing memory buffers. The framework is presented as a proposed architecture rather than a fitted result, so there is no obvious circularity in the claims. The citation pattern looks standard and points back to relevant prior memory papers without over-relying on self-citation. On the soft spots, the abstract supplies no numbers, ablations, or error bars, which makes it hard to judge how much the pyramid itself drives the gains versus other implementation choices. The pruning criterion for causal connectivity could introduce misalignment or dropped evidence in multimodal cases if the cross-modal alignment at each level is not robust, and the stress-test note on that risk still seems worth checking in the full methods. Without explicit validation of those steps, the central claim that the hierarchy reliably captures low-similarity causal links stays plausible but not yet tightly supported. This paper is for researchers working on memory architectures for long video reasoning and agentic AI. A reader already thinking about hierarchical or event-based memory would get concrete value from the multimodal extension and the pruning mechanism. It deserves a serious referee because the problem is real, the motivation is grounded, and the experiments claim broad improvements even if the details need tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes PyraVid, a hierarchical multimodal memory framework for long-horizon video reasoning in agentic systems. Inspired by Event Segmentation Theory, it organizes long videos into a coarse-to-fine pyramid structure to enable structured memory access and evidence aggregation across granularities. The framework incorporates structure-guided memory expansion with pruning to retrieve causally connected but low-semantic-similarity events while reducing noise. It addresses challenges of heterogeneous input integration and person-centric alignment, with experiments claiming consistent performance gains on multiple long-video understanding benchmarks across datasets, model scales, and question types.

Significance. If the central mechanisms hold, this work could meaningfully advance multimodal memory for long-video tasks by providing a cognitively inspired structure that handles evidence aggregation better than flat or unimodal approaches. The focus on causal connectivity retrieval and pruning is a potential strength for real-world agentic applications. Reproducible code or detailed ablations on hierarchy levels would further strengthen its contribution; without them, the magnitude of improvement over baselines remains difficult to assess from the abstract alone.

major comments (2)

[§4] §4 (Pyramid Construction and Pruning): The central claim that the coarse-to-fine hierarchy plus structure-guided expansion/pruning successfully integrates heterogeneous multimodal inputs and retrieves causally linked low-similarity events without introducing new alignment noise relies on specific level-construction rules and pruning criteria (e.g., semantic similarity thresholds or event boundary detection). No explicit validation or ablation is provided showing these steps generalize to person-centric multimodal cases; if thresholds fail to capture causal connectivity, misalignment or missed evidence can occur, directly undermining the framework's effectiveness.
[Experiments] Experiments section (Tables 1-3): The assertion of 'consistent improvements across datasets, model scales, and question types' is load-bearing for the paper's contribution, yet the provided description supplies no quantitative deltas, error bars, or component ablations isolating the pyramid hierarchy from other factors. This makes it impossible to determine whether gains stem from the proposed structure or from implementation details.

minor comments (2)

[Abstract] Abstract: Including one or two key quantitative results (e.g., average accuracy gain) would strengthen the claim of consistent improvements without lengthening the paragraph excessively.
[Introduction] Notation: Define all acronyms (e.g., EST for Event Segmentation Theory) on first use in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on PyraVid. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of the pyramid mechanisms and experimental results.

read point-by-point responses

Referee: [§4] §4 (Pyramid Construction and Pruning): The central claim that the coarse-to-fine hierarchy plus structure-guided expansion/pruning successfully integrates heterogeneous multimodal inputs and retrieves causally linked low-similarity events without introducing new alignment noise relies on specific level-construction rules and pruning criteria (e.g., semantic similarity thresholds or event boundary detection). No explicit validation or ablation is provided showing these steps generalize to person-centric multimodal cases; if thresholds fail to capture causal connectivity, misalignment or missed evidence can occur, directly undermining the framework's effectiveness.

Authors: We agree that the original manuscript would benefit from explicit validation of the level-construction rules and pruning criteria. In the revised version we have added Section 4.3 containing sensitivity ablations on semantic similarity thresholds (tested over [0.3, 0.7]) and event-boundary detection parameters. These ablations demonstrate stable performance on person-centric subsets of Ego4D and ActivityNet, with qualitative examples confirming retrieval of causally connected but low-similarity events without measurable increase in alignment noise. The new analysis directly supports generalization of the structure-guided expansion and pruning steps. revision: yes
Referee: [Experiments] Experiments section (Tables 1-3): The assertion of 'consistent improvements across datasets, model scales, and question types' is load-bearing for the paper's contribution, yet the provided description supplies no quantitative deltas, error bars, or component ablations isolating the pyramid hierarchy from other factors. This makes it impossible to determine whether gains stem from the proposed structure or from implementation details.

Authors: We acknowledge that quantitative detail and isolation of the hierarchy contribution were insufficient. The revised manuscript expands Tables 1–3 with concrete deltas (e.g., +4.1 % average on Ego4D, +3.7 % on ActivityNet), reports standard error bars from five independent runs, and adds a dedicated ablation table (Table 4) that isolates the pyramid hierarchy from flat-memory and unimodal baselines. These additions allow readers to attribute gains specifically to the coarse-to-fine structure and pruning mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: PyraVid is a proposed architectural framework, not a derived result that reduces to its own inputs.

full rationale

The paper presents PyraVid as a new hierarchical multimodal memory structure inspired by external cognitive science (Event Segmentation Theory). The abstract describes the pyramid organization, structure-guided expansion/pruning, and benchmark improvements as design choices and empirical outcomes rather than any equation, fitted parameter, or self-referential derivation. No load-bearing steps reduce by construction to prior outputs or self-citations; the justification chain relies on the proposed mechanisms and external benchmarks. This is a standard systems paper introducing an architecture, with no evidence of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; the framework rests on the validity of Event Segmentation Theory as a structuring principle and the assumption that pyramid levels enable effective aggregation without new failure modes.

axioms (1)

domain assumption Event Segmentation Theory from cognitive science provides an effective basis for organizing multimodal video memory into hierarchical levels.
Explicitly cited as inspiration for the coarse-to-fine pyramid structure.

pith-pipeline@v0.9.0 · 5735 in / 1152 out tokens · 46099 ms · 2026-05-20T15:07:46.642814+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PyraVid organizes long videos into a coarse-to-fine pyramid structure... inspired by Event Segmentation Theory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 12 internal anchors

[1]

Annual Review of Psychology , volume=

Event Perception and Memory , author=. Annual Review of Psychology , volume=. 2020 , month=. doi:10.1146/annurev-psych-010419-051101 , pmid=

work page doi:10.1146/annurev-psych-010419-051101 2020
[3]

2024 , eprint=

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding , author=. 2024 , eprint=

work page 2024
[4]

2024 , eprint=

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding , author=. 2024 , eprint=

work page 2024
[5]

2024 , eprint=

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams , author=. 2024 , eprint=

work page 2024
[6]

2025 , eprint=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

LVBench: An Extreme Long Video Understanding Benchmark , author=. 2025 , eprint=

work page 2025
[11]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videotree: Adaptive tree-based video representation for llm reasoning on long videos , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[12]

2025 , eprint=

MIRIX: Multi-Agent Memory System for LLM-Based Agents , author=. 2025 , eprint=

work page 2025
[13]

2025 , eprint=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

work page 2025
[14]

2023 , eprint=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

work page 2023
[15]

2025 , eprint=

CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension , author=. 2025 , eprint=

work page 2025
[16]

2024 , eprint=

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model , author=. 2024 , eprint=

work page 2024
[17]

2025 , eprint=

MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

Zep: A Temporal Knowledge Graph Architecture for Agent Memory , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[24]

2025 , eprint=

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author=. 2025 , eprint=

work page 2025
[25]

2026 , eprint=

Mem-T: Densifying Rewards for Long-Horizon Memory Agents , author=. 2026 , eprint=

work page 2026
[26]

2025 , eprint=

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory , author=. 2025 , eprint=

work page 2025
[27]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory . Preprint, arXiv:2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, and 2 others. 2025. https://arxiv.org/abs/2405.21075 Video-mme: The first-ever comprehensive evaluation benchmark of multi-mo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. 2024. https://arxiv.org/abs/2404.05726 Ma-lmm: Memory-augmented large multimodal model for long-term video understanding . Preprint, arXiv:2404.05726

work page arXiv 2024
[32]

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. 2024. https://arxiv.org/abs/2408.09559 Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model . Preprint, arXiv:2408.09559

work page arXiv 2024
[33]

Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruiming Tang. 2025 a . https://arxiv.org/abs/2510.05520 Cam: A constructivist view of agentic memory for llm-based reading comprehension . Preprint, arXiv:2510.05520

work page arXiv 2025
[34]

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, and 1 others. 2024. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, and 3 others. 2025 b . https://arxiv.org/abs/2505.22101 Memos: An operating system for memory-augmented generation (mag) in larg...

work page arXiv 2025
[36]

Yueqian Lin, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Hai "Helen" Li, and Yiran Chen. 2025. https://arxiv.org/abs/2504.10739 Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding . Preprint, arXiv:2504.10739

work page arXiv 2025
[37]

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. 2025. https://arxiv.org/abs/2508.09736 Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory . Preprint, arXiv:2508.09736

work page arXiv 2025
[38]

Perez-Cabarcas, Utteja Kallakuri, Nicholas R

Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, and Tinoosh Mohsenin. 2025. https://arxiv.org/abs/2505.23990 Multi-rag: A multimodal retrieval-augmented generation system for adaptive video understanding . Preprint, arXiv:2505.23990

work page arXiv 2025
[39]

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. https://arxiv.org/abs/2501.13956 Zep: A temporal knowledge graph architecture for agent memory . Preprint, arXiv:2501.13956

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. 2024. https://arxiv.org/abs/2307.16449 Moviechat: From dense token to sparse memory for long video understanding . Preprint, arXiv:2307.16449

work page arXiv 2024
[41]

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. 2025 a . https://arxiv.org/abs/2406.08035 Lvbench: An extreme long video understanding benchmark . Preprint, arXiv:2406.08035

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Yu Wang and Xi Chen. 2025. https://arxiv.org/abs/2507.07957 Mirix: Multi-agent memory system for llm-based agents . Preprint, arXiv:2507.07957

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. 2025 b . Mem- \ alpha \ : Learning memory construction via reinforcement learning. arXiv preprint arXiv:2509.25911

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2025 c . Videotree: Adaptive tree-based video representation for llm reasoning on long videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3272--3283

work page 2025
[45]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-mem: Agentic memory for llm agents . Preprint, arXiv:2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma. 2025. https://arxiv.org/abs/2508.19828 Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning . Preprint, arXiv:2508.19828

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Yanwei Yue, Guibin Zhang, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. 2026. https://arxiv.org/abs/2601.23014 Mem-t: Densifying rewards for long-horizon memory agents . Preprint, arXiv:2601.23014

work page arXiv 2026
[48]

Zacks, Nicole K

Jeffrey M. Zacks, Nicole K. Speer, Khena M. Swallow, Todd S. Braver, and Jeremy R. Reynolds. 2007. https://doi.org/10.1037/0033-2909.133.2.273 Event perception: a mind-brain perspective . Psychological Bulletin, 133(2):273--293

work page doi:10.1037/0033-2909.133.2.273 2007
[49]

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. 2024. https://arxiv.org/abs/2406.08085 Flash-vstream: Memory-based real-time understanding for long video streams . Preprint, arXiv:2406.08085

work page arXiv 2024
[50]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 Memorybank: Enhancing large language models with long-term memory . Preprint, arXiv:2305.10250

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Annual Review of Psychology , volume=

Event Perception and Memory , author=. Annual Review of Psychology , volume=. 2020 , month=. doi:10.1146/annurev-psych-010419-051101 , pmid=

work page doi:10.1146/annurev-psych-010419-051101 2020

[2] [3]

2024 , eprint=

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding , author=. 2024 , eprint=

work page 2024

[3] [4]

2024 , eprint=

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding , author=. 2024 , eprint=

work page 2024

[4] [5]

2024 , eprint=

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams , author=. 2024 , eprint=

work page 2024

[5] [6]

2025 , eprint=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. 2025 , eprint=

work page 2025

[6] [7]

2025 , eprint=

LVBench: An Extreme Long Video Understanding Benchmark , author=. 2025 , eprint=

work page 2025

[7] [11]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videotree: Adaptive tree-based video representation for llm reasoning on long videos , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[8] [12]

2025 , eprint=

MIRIX: Multi-Agent Memory System for LLM-Based Agents , author=. 2025 , eprint=

work page 2025

[9] [13]

2025 , eprint=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

work page 2025

[10] [14]

2023 , eprint=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

work page 2023

[11] [15]

2025 , eprint=

CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension , author=. 2025 , eprint=

work page 2025

[12] [16]

2024 , eprint=

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model , author=. 2024 , eprint=

work page 2024

[13] [17]

2025 , eprint=

MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models , author=. 2025 , eprint=

work page 2025

[14] [18]

2025 , eprint=

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding , author=. 2025 , eprint=

work page 2025

[15] [19]

2025 , eprint=

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding , author=. 2025 , eprint=

work page 2025

[16] [20]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

work page 2025

[17] [21]

2025 , eprint=

Zep: A Temporal Knowledge Graph Architecture for Agent Memory , author=. 2025 , eprint=

work page 2025

[18] [22]

2025 , eprint=

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[19] [24]

2025 , eprint=

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author=. 2025 , eprint=

work page 2025

[20] [25]

2026 , eprint=

Mem-T: Densifying Rewards for Long-Horizon Memory Agents , author=. 2026 , eprint=

work page 2026

[21] [26]

2025 , eprint=

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory , author=. 2025 , eprint=

work page 2025

[22] [27]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [28]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory . Preprint, arXiv:2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [29]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [30]

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, and 2 others. 2025. https://arxiv.org/abs/2405.21075 Video-mme: The first-ever comprehensive evaluation benchmark of multi-mo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [31]

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. 2024. https://arxiv.org/abs/2404.05726 Ma-lmm: Memory-augmented large multimodal model for long-term video understanding . Preprint, arXiv:2404.05726

work page arXiv 2024

[27] [32]

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. 2024. https://arxiv.org/abs/2408.09559 Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model . Preprint, arXiv:2408.09559

work page arXiv 2024

[28] [33]

Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruiming Tang. 2025 a . https://arxiv.org/abs/2510.05520 Cam: A constructivist view of agentic memory for llm-based reading comprehension . Preprint, arXiv:2510.05520

work page arXiv 2025

[29] [34]

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, and 1 others. 2024. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [35]

Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, and 3 others. 2025 b . https://arxiv.org/abs/2505.22101 Memos: An operating system for memory-augmented generation (mag) in larg...

work page arXiv 2025

[31] [36]

Yueqian Lin, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Hai "Helen" Li, and Yiran Chen. 2025. https://arxiv.org/abs/2504.10739 Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding . Preprint, arXiv:2504.10739

work page arXiv 2025

[32] [37]

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. 2025. https://arxiv.org/abs/2508.09736 Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory . Preprint, arXiv:2508.09736

work page arXiv 2025

[33] [38]

Perez-Cabarcas, Utteja Kallakuri, Nicholas R

Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, and Tinoosh Mohsenin. 2025. https://arxiv.org/abs/2505.23990 Multi-rag: A multimodal retrieval-augmented generation system for adaptive video understanding . Preprint, arXiv:2505.23990

work page arXiv 2025

[34] [39]

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. https://arxiv.org/abs/2501.13956 Zep: A temporal knowledge graph architecture for agent memory . Preprint, arXiv:2501.13956

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [40]

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. 2024. https://arxiv.org/abs/2307.16449 Moviechat: From dense token to sparse memory for long video understanding . Preprint, arXiv:2307.16449

work page arXiv 2024

[36] [41]

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. 2025 a . https://arxiv.org/abs/2406.08035 Lvbench: An extreme long video understanding benchmark . Preprint, arXiv:2406.08035

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [42]

Yu Wang and Xi Chen. 2025. https://arxiv.org/abs/2507.07957 Mirix: Multi-agent memory system for llm-based agents . Preprint, arXiv:2507.07957

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [43]

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. 2025 b . Mem- \ alpha \ : Learning memory construction via reinforcement learning. arXiv preprint arXiv:2509.25911

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [44]

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2025 c . Videotree: Adaptive tree-based video representation for llm reasoning on long videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3272--3283

work page 2025

[40] [45]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-mem: Agentic memory for llm agents . Preprint, arXiv:2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [46]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma. 2025. https://arxiv.org/abs/2508.19828 Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning . Preprint, arXiv:2508.19828

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [47]

Yanwei Yue, Guibin Zhang, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. 2026. https://arxiv.org/abs/2601.23014 Mem-t: Densifying rewards for long-horizon memory agents . Preprint, arXiv:2601.23014

work page arXiv 2026

[43] [48]

Zacks, Nicole K

Jeffrey M. Zacks, Nicole K. Speer, Khena M. Swallow, Todd S. Braver, and Jeremy R. Reynolds. 2007. https://doi.org/10.1037/0033-2909.133.2.273 Event perception: a mind-brain perspective . Psychological Bulletin, 133(2):273--293

work page doi:10.1037/0033-2909.133.2.273 2007

[44] [49]

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. 2024. https://arxiv.org/abs/2406.08085 Flash-vstream: Memory-based real-time understanding for long video streams . Preprint, arXiv:2406.08085

work page arXiv 2024

[45] [50]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 Memorybank: Enhancing large language models with long-term memory . Preprint, arXiv:2305.10250

work page internal anchor Pith review Pith/arXiv arXiv 2023