arxiv: 2603.01455 · v3 · submitted 2026-03-02 · 💻 cs.CV · cs.AI· cs.CL· cs.IR· cs.MM

Recognition: no theorem link

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian , Yuting Wang , Hanshu Yao , Jinpeng Wang , Bin Chen , Yaowei Wang , Min Zhang , Shu-Tao Xia

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.IRcs.MM

keywords multimodal memorylong-horizon video understandingFuzzy-Trace Theorysemantic information bottleneckvideo agentsmemory distillationpyramidal architecturegist representation

0 comments

The pith

A pyramidal memory architecture distills fine-grained video perceptions into compact semantic schemas for efficient long-horizon reasoning in multimodal agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models handle short tasks well but falter on extended videos due to fixed context windows and memory that either piles up redundant visuals or strips away details through crude text summaries. The paper introduces MM-Mem to organize memory in three layers that progressively convert raw perceptual traces into abstract semantic structures, mirroring how human cognition separates verbatim details from gist. A Semantic Information Bottleneck objective, optimized through SIB-GRPO, controls the compression while an entropy-based retrieval pulls relevant levels during inference. If the approach holds, agents could maintain coherent understanding across long video sequences without the latency of dense storage or the hallucinations of aggressive summarization.

Core claim

MM-Mem structures memory hierarchically into a Sensory Buffer for raw perceptual traces, an Episodic Stream for intermediate events, and a Symbolic Schema for high-level gist, using a Semantic Information Bottleneck objective and SIB-GRPO optimization to distill verbatim details into task-relevant semantics while preserving necessary information. An entropy-driven top-down retrieval strategy then accesses the appropriate memory level during inference.

What carries the argument

MM-Mem pyramidal architecture with Sensory Buffer, Episodic Stream, and Symbolic Schema layers, controlled by the Semantic Information Bottleneck objective and SIB-GRPO training, plus entropy-driven top-down retrieval.

If this is right

Long-horizon video tasks become feasible without quadratic growth in context or latency from raw frame accumulation.
Memory usage shifts from dense visual storage to compact symbolic representations while retaining task utility.
Streaming video agents gain robust generalization across offline and online benchmarks through the same hierarchical organization.
Cognition-inspired memory hierarchies prove effective for balancing compression against retention in multimodal LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could be tested on long audio sequences or robot sensor streams where raw data is similarly voluminous.
Hybrid systems might combine this pyramidal structure with external knowledge bases to handle even longer horizons without retraining.
If the bottleneck objective generalizes, it offers a route to parameter-efficient continual learning for video agents that avoids catastrophic forgetting of early episodes.

Load-bearing premise

The hierarchical levels drawn from Fuzzy-Trace Theory plus the Semantic Information Bottleneck loss can be stably trained in multimodal models without discarding task-critical details that fall outside the training reward signal.

What would settle it

An experiment showing that MM-Mem drops key visual or temporal details on a long video benchmark where dense baselines retain them and the model fails to reach state-of-the-art accuracy.

Figures

Figures reproduced from arXiv: 2603.01455 by Bin Chen, Hanshu Yao, Jinpeng Wang, Min Zhang, Niu Lian, Shu-Tao Xia, Yaowei Wang, Yuting Wang.

**Figure 1.** Figure 1: Existing memory paradigms (a-b), inspiration (c), and our insight (d). (a) Vision-centric methods incur redundancy and high latency due to dense visual memories. (b) Text-centric methods suffer from information loss during captioning, leading to hallucination and ambiguity. (c) The natural complementarity between vision and text neatly aligns with the distinction between verbatim and gist traces in Fuzzy-… view at source ↗

**Figure 2.** Figure 2: Overview of MM-Mem: MM-Mem unifies visual and textual memory through (left) a bottom-up memory construction process, which transforms raw sensory frames into abstract symbolic schemas, and (right) a top-down retrieval process that supports query-adaptive reasoning. where 𝒪 = {ADD_NEW, MERGE, DISCARD}. ADD_NEW appends a node initialized from mt,i ; MERGE integrates mt,i into e ⋆ ; and DISCARD removes redund… view at source ↗

**Figure 3.** Figure 3: Visualization of ablation results and memory representations. component. Removing SIB-GRPO consistently degrades performance, with the largest drop on Long, suggesting its importance for consolidation under long temporal dependencies. Further removing the Pyramid (hierarchical) memory yields an additional decrease, again most pronounced on Long and Overall. These results show that pyramid memory compleme… view at source ↗

**Figure 4.** Figure 4: A qualitative example of MM-Mem’s coarse-to-fine retrieval across memory layers. Buffer, where concrete visual cues at a finer granularity can be retrieved to support precise verification. Overall, these examples illustrate that the pyramidal multimodal memory of MM-Mem supports a coarse-to-fine retrieval process across memory layers, allowing the model to progressively move from abstract semantic reason… view at source ↗

**Figure 5.** Figure 5: HD-EPIC++ is an egocentric long-horizon kitchen video benchmark with highly detailed annotations, covering fine-grained action perception, temporal reasoning, 3D spatial understanding, object motion, gaze, and diverse VQA tasks (e.g., recipes and ingredients). LoRA lora_rank 64 lora_alpha 128 lora_dropout 0.05 SIB-GRPO epoch 3 batch_size 8 learning_rate 1e-5 beta 0.1 ppo_clip_epsilon 0.2 use_importance_sam… view at source ↗

**Figure 6.** Figure 6: Answer Agent prompt template instructing the model to select the best option for a video-based multiple-choice question and respond only with the corresponding letter (A–D). which indicates whether the prediction is judged correct, and (ii) Score, which is computed as the average of the assigned scores. E. Details of Prompts Prompt sources. The Answer Agent prompt shown in Figure 6 is adapted from the pr… view at source ↗

**Figure 7.** Figure 7: Prompts used by VStream-QA evaluation agent (system and user prompts). standard VStream-QA evaluation setup. F. Additional Experiments F.1. Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-Mem gives a clear three-layer memory hierarchy for long video agents but the SOTA claims rest on unshown numbers and a reward that may not protect all needed details.

read the letter

The main point is a pyramidal memory architecture that distills raw video traces into abstract schemas using ideas from Fuzzy-Trace Theory. It splits memory into a Sensory Buffer for initial perception, an Episodic Stream for sequences, and a Symbolic Schema for high-level gist, then uses a derived Semantic Information Bottleneck objective trained with SIB-GRPO plus entropy-based retrieval at test time. The public code link is useful for seeing how the pieces fit together. This setup directly targets the usual trade-off between keeping too much low-level video and losing too much through heavy captioning, and the progressive compression idea is laid out plainly. The design draws a straight line from cognitive motivation to the optimizer, which is more specific than most prior memory add-ons for MLLMs. The soft spots sit in the results and the optimization. The abstract states SOTA on four benchmarks for both offline and streaming tasks, yet supplies no scores, ablations, or variance numbers, so the actual lift from the hierarchy versus other factors stays unclear. The bottleneck trade-off parameter is almost certainly tuned to data, and the reward signal tied to downstream performance does not explicitly guard against dropping perceptual or episodic details that matter for the task but are not scored. That matches the stress-test worry about non-rewarded information being lost without detection. Readers working on video agents or long-context multimodal models will find the structure and code worth examining. The paper shows coherent thinking on the memory organization even if the empirical side needs tightening. It deserves peer review so the experiments can be checked directly.

Referee Report

3 major / 0 minor

Summary. The paper proposes MM-Mem, a pyramidal multimodal memory architecture for long-horizon video agents in multimodal LLMs, grounded in Fuzzy-Trace Theory. It organizes memory hierarchically into a Sensory Buffer (verbatim traces), Episodic Stream, and Symbolic Schema (gist schemas) to progressively distill fine-grained perceptual information. A Semantic Information Bottleneck (SIB) objective is derived to control compression versus retention, optimized via the introduced SIB-GRPO method, with an entropy-driven top-down retrieval strategy at inference. The central claim is that this cognition-inspired design achieves state-of-the-art performance on both offline and streaming tasks across four benchmarks while demonstrating robust generalization.

Significance. If the experimental results hold with proper quantitative validation, the work would offer a meaningful advance in efficient long-context video reasoning by replacing dense visual accumulation or lossy captioning with a hierarchical, theory-grounded memory that explicitly manages the verbatim-to-gist trade-off. Public code release strengthens reproducibility and potential impact on downstream video-agent applications.

major comments (3)

Abstract: the claim that MM-Mem 'achieves state-of-the-art performance on both offline and streaming tasks' across four benchmarks is presented without any quantitative numbers, ablation tables, error bars, or baseline comparisons, rendering the central empirical claim unverifiable from the provided text.
Semantic Information Bottleneck section: the SIB objective is described as derived, yet the trade-off parameter is listed among free parameters and appears to require data-driven fitting; this introduces circularity because the rate-distortion balance is not shown to be independent of the fitted quantity used in SIB-GRPO.
SIB-GRPO optimization: the reward is tied only to downstream task performance, so the bottleneck can discard perceptual or episodic details that are task-critical but unrewarded; no explicit mechanism or analysis demonstrates that the pyramidal hierarchy preserves such information, which is load-bearing for the claim of stable compression without loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and will incorporate revisions to improve the clarity and verifiability of our claims.

read point-by-point responses

Referee: Abstract: the claim that MM-Mem 'achieves state-of-the-art performance on both offline and streaming tasks' across four benchmarks is presented without any quantitative numbers, ablation tables, error bars, or baseline comparisons, rendering the central empirical claim unverifiable from the provided text.

Authors: We agree with this observation. The abstract in the current version summarizes the results qualitatively. In the revised manuscript, we will include specific quantitative results, such as the performance metrics on each benchmark (e.g., accuracy or success rate improvements), and direct references to the experimental tables. This will allow readers to verify the SOTA claim immediately from the abstract. revision: yes
Referee: Semantic Information Bottleneck section: the SIB objective is described as derived, yet the trade-off parameter is listed among free parameters and appears to require data-driven fitting; this introduces circularity because the rate-distortion balance is not shown to be independent of the fitted quantity used in SIB-GRPO.

Authors: The SIB objective is derived from the semantic information bottleneck principle, where the trade-off parameter λ is a hyperparameter that controls the balance between compression (rate) and retention (distortion). While λ is selected via validation to optimize performance, the functional form of the objective is independent of the data and the specific fitting process. We will revise the section to include the complete derivation, clarify that λ is a standard hyperparameter (similar to β in β-VAE), and add experiments showing the sensitivity to λ to demonstrate robustness. revision: partial
Referee: SIB-GRPO optimization: the reward is tied only to downstream task performance, so the bottleneck can discard perceptual or episodic details that are task-critical but unrewarded; no explicit mechanism or analysis demonstrates that the pyramidal hierarchy preserves such information, which is load-bearing for the claim of stable compression without loss.

Authors: This is a valid concern regarding the potential for unintended information loss. The pyramidal structure ensures that lower levels (Sensory Buffer and Episodic Stream) retain detailed information, with the SIB applied progressively to distill only at higher levels. The reward in SIB-GRPO encourages retention of task-relevant semantics, but to address this, we will include additional ablation studies in the revision that isolate the contribution of each memory level and show that removing lower levels degrades performance on tasks requiring fine-grained details. We will also provide examples illustrating preserved information across levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent from Fuzzy-Trace Theory grounding.

full rationale

The abstract states that MM-Mem is grounded in Fuzzy-Trace Theory and that the Semantic Information Bottleneck objective is derived to govern memory construction, with SIB-GRPO introduced for optimization. No equations or self-citations are provided in the visible text that reduce the central claim to a fitted parameter renamed as prediction or to a self-referential definition. The pyramidal hierarchy and entropy-driven retrieval are described as design choices validated by experiments on 4 benchmarks, without evidence that any load-bearing step collapses to its own inputs by construction. This is the most common honest outcome for papers whose core architecture is externally motivated and whose performance claims rest on empirical results rather than tautological re-derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The architecture rests on the applicability of Fuzzy-Trace Theory to video LLMs and on the existence of a tunable trade-off parameter inside the Semantic Information Bottleneck; both are introduced without independent empirical grounding in the abstract.

free parameters (1)

SIB trade-off parameter
Controls compression versus retention; must be chosen or fitted to achieve the reported performance.

axioms (1)

domain assumption Fuzzy-Trace Theory provides a valid hierarchical structure for multimodal video memory
The paper grounds the three-layer design directly in this psychological theory.

invented entities (1)

SIB-GRPO no independent evidence
purpose: Optimizer for the semantic information bottleneck objective
New training procedure introduced to balance memory compression and task performance.

pith-pipeline@v0.9.0 · 5564 in / 1369 out tokens · 46530 ms · 2026-05-15T18:05:46.734272+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, IlgeAkkaya, FlorenciaLeoniAleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Videominer: Iteratively grounding key frames of hour-long videos via tree- based group relative policy optimization

Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, and Yutong Gao. Videominer: Iteratively grounding key frames of hour-long videos via tree- based group relative policy optimization. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 23773–23783, 2025

work page 2025
[4]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and au- dio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Towards large language models with human-like episodic memory.Trends in Cognitive Sciences, 2025

Cody V Dong, Qihong Lu, Kenneth A Norman, and Sebastian Michelmann. Towards large language models with human-like episodic memory.Trends in Cognitive Sciences, 2025

work page 2025
[6]

Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025

work page 2025
[7]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

work page arXiv 2025
[8]

Inducing high energy-latencyoflargevision-languagemodelswith verbose images

Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, and Wei Liu. Inducing high energy-latencyoflargevision-languagemodelswith verbose images. InICLR, 2024

work page 2024
[9]

Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 10 From Verbatim to Gist: Distilling Pyramidal Mul...

work page 2024
[10]

View from the top: Hierarchies and reverse hierarchies in the visual system.Neuron, 36(5):791–804, 2002

Shaul Hochstein and Merav Ahissar. View from the top: Hierarchies and reverse hierarchies in the visual system.Neuron, 36(5):791–804, 2002

work page 2002
[11]

Licomemory: Lightweight and cognitive agentic memory for efficient long-term reasoning.arXiv preprint arXiv:2511.01448, 2025

Zhengjun Huang, Zhoujin Tian, Qintian Guo, Fangyuan Zhang, Yingli Zhou, Di Jiang, Zeying Xie, and Xiaofang Zhou. Licomemory: Lightweight and cognitive agentic memory for efficient long-term reasoning.arXiv preprint arXiv:2511.01448, 2025

work page arXiv 2025
[12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Rad- ford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xi- aochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024

work page 2024
[14]

Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InInternational conference on ma- chine learning, pages 12888–12900. PMLR, 2022

work page 2022
[15]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InInternational conference on ma- chine learning, pages 19730–19742. PMLR, 2023

work page 2023
[16]

Llama- vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama- vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[17]

Video-llava: Learning united visual representation by alignment before projec- tion

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projec- tion. InProceedings of the 2024 conference on empir- ical methods in natural language processing, pages 5971–5984, 2024

work page 2024
[18]

Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

work page 2023
[19]

Seeing, listening, remembering, and reasoning: A multi- modal agent with long-term memory.arXiv preprint arXiv:2508.09736, 2025

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multi- modal agent with long-term memory.arXiv preprint arXiv:2508.09736, 2025

work page arXiv 2025
[20]

Video-rag: Visually- aligned retrieval-augmented long video comprehen- sion.arXiv preprint arXiv:2411.13093, 2024

Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, et al. Video-rag: Visually- aligned retrieval-augmented long video comprehen- sion.arXiv preprint arXiv:2411.13093, 2024

work page arXiv 2024
[21]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12585–12602, 2024

work page 2024
[22]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text em- beddings through self-knowledge distillation, 2024

Multi-Linguality Multi-Functionality Multi- Granularity. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text em- beddings through self-knowledge distillation, 2024

work page 2024
[23]

Memgpt: Towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: Towards llms as operating systems. 2023

work page 2023
[24]

Hd-epic: A highly-detailed egocen- tric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocen- tric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

work page 2025
[25]

Streaming long video understanding with large lan- guage models.Advances in Neural Information Pro- cessing Systems, 37:119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large lan- guage models.Advances in Neural Information Pro- cessing Systems, 37:119336–119360, 2024

work page 2024
[26]

Fuzzy- trace theory: An interim synthesis.Learning and Individual Differences, 7(1):1–75, 1995

Valerie F Reyna and Charles J Brainerd. Fuzzy- trace theory: An interim synthesis.Learning and Individual Differences, 7(1):1–75, 1995

work page 1995
[27]

Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding.arXiv preprint arXiv:2510.14032, 2025

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding.arXiv preprint arXiv:2510.14032, 2025

work page arXiv 2025
[28]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

work page 2024
[29]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, 11 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of cont...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[31]

Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023

Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and ver- satile video understanding system.arXiv preprint arXiv:2304.14407, 2023

work page arXiv 2023
[32]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Videoagent: Long-form video under- standing with large language model as agent

XiaohanWang, YuhuiZhang, OrrZohar, andSerena Yeung-Levy. Videoagent: Long-form video under- standing with large language model as agent. In European Conference on Computer Vision, pages 58–

work page
[34]

Longllava: Scaling multi- modalllmsto1000imagesefficientlyviaahybridar- chitecture.arXiv preprint arXiv:2409.02889, 2024

Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi- modalllmsto1000imagesefficientlyviaahybridar- chitecture.arXiv preprint arXiv:2409.02889, 2024

work page arXiv 2024
[35]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jae- hong Yoon, Feng Cheng, Gedas Bertasius, and Mo- hit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272–3283, 2025

work page 2025
[36]

C-pack: Packed resources for general chinese embeddings

ShitaoXiao, ZhengLiu, PeitianZhang, NiklasMuen- nighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in informa- tion retrieval, pages 641–649, 2024

work page 2024
[37]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024

work page arXiv 2024
[38]

A-MEM: Agentic Memory for LLM Agents

WujiangXu, ZujieLiang, KaiMei, HangGao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, JeffZPan, HinrichSchütze, etal. Memory- r1: Enhancing large language model agents to man- age and utilize memories via reinforcement learn- ing.arXiv preprint arXiv:2508.19828, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videol- lama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time under- standing for long video streams.arXiv preprint arXiv:2406.08085, 2024

work page arXiv 2024
[42]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InPro- ceedings of the AAAI Conference on Artificial Intelli- gence, volume 38, pages 19724–19731, 2024

work page 2024
[45]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691–13701, 2025. 12 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic ...

work page 2025
[46]

Select the best answer to the following multiple-choice question based on the video.\n

is adopted as an automatic judge. Given a model prediction, whether the prediction is correct is judged, and a score between 0 and 5 is assigned. Accordingly, two metrics are reported on VStream-QA: (i) Accuracy, "Select the best answer to the following multiple-choice question based on the video.\n" "Respond with only the letter (A, B, C, or D) of the co...

work page