pith. machine review for the scientific record. sign in

arxiv: 2603.01455 · v3 · submitted 2026-03-02 · 💻 cs.CV · cs.AI· cs.CL· cs.IR· cs.MM

Recognition: no theorem link

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.IRcs.MM
keywords multimodal memorylong-horizon video understandingFuzzy-Trace Theorysemantic information bottleneckvideo agentsmemory distillationpyramidal architecturegist representation
0
0 comments X

The pith

A pyramidal memory architecture distills fine-grained video perceptions into compact semantic schemas for efficient long-horizon reasoning in multimodal agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models handle short tasks well but falter on extended videos due to fixed context windows and memory that either piles up redundant visuals or strips away details through crude text summaries. The paper introduces MM-Mem to organize memory in three layers that progressively convert raw perceptual traces into abstract semantic structures, mirroring how human cognition separates verbatim details from gist. A Semantic Information Bottleneck objective, optimized through SIB-GRPO, controls the compression while an entropy-based retrieval pulls relevant levels during inference. If the approach holds, agents could maintain coherent understanding across long video sequences without the latency of dense storage or the hallucinations of aggressive summarization.

Core claim

MM-Mem structures memory hierarchically into a Sensory Buffer for raw perceptual traces, an Episodic Stream for intermediate events, and a Symbolic Schema for high-level gist, using a Semantic Information Bottleneck objective and SIB-GRPO optimization to distill verbatim details into task-relevant semantics while preserving necessary information. An entropy-driven top-down retrieval strategy then accesses the appropriate memory level during inference.

What carries the argument

MM-Mem pyramidal architecture with Sensory Buffer, Episodic Stream, and Symbolic Schema layers, controlled by the Semantic Information Bottleneck objective and SIB-GRPO training, plus entropy-driven top-down retrieval.

If this is right

  • Long-horizon video tasks become feasible without quadratic growth in context or latency from raw frame accumulation.
  • Memory usage shifts from dense visual storage to compact symbolic representations while retaining task utility.
  • Streaming video agents gain robust generalization across offline and online benchmarks through the same hierarchical organization.
  • Cognition-inspired memory hierarchies prove effective for balancing compression against retention in multimodal LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could be tested on long audio sequences or robot sensor streams where raw data is similarly voluminous.
  • Hybrid systems might combine this pyramidal structure with external knowledge bases to handle even longer horizons without retraining.
  • If the bottleneck objective generalizes, it offers a route to parameter-efficient continual learning for video agents that avoids catastrophic forgetting of early episodes.

Load-bearing premise

The hierarchical levels drawn from Fuzzy-Trace Theory plus the Semantic Information Bottleneck loss can be stably trained in multimodal models without discarding task-critical details that fall outside the training reward signal.

What would settle it

An experiment showing that MM-Mem drops key visual or temporal details on a long video benchmark where dense baselines retain them and the model fails to reach state-of-the-art accuracy.

Figures

Figures reproduced from arXiv: 2603.01455 by Bin Chen, Hanshu Yao, Jinpeng Wang, Min Zhang, Niu Lian, Shu-Tao Xia, Yaowei Wang, Yuting Wang.

Figure 1
Figure 1. Figure 1: Existing memory paradigms (a-b), inspiration (c), and our insight (d). (a) Vision-centric methods incur redundancy and high latency due to dense visual memories. (b) Text-centric methods suffer from information loss during captioning, leading to hallucination and ambiguity. (c) The natural complementar￾ity between vision and text neatly aligns with the distinction between verbatim and gist traces in Fuzzy-… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MM-Mem: MM-Mem unifies visual and textual memory through (left) a bottom-up memory construction process, which transforms raw sensory frames into abstract symbolic schemas, and (right) a top-down retrieval process that supports query-adaptive reasoning. where 𝒪 = {ADD_NEW, MERGE, DISCARD}. ADD_NEW appends a node initialized from mt,i ; MERGE integrates mt,i into e ⋆ ; and DISCARD removes redund… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of ablation results and memory representations. component. Removing SIB-GRPO consistently degrades performance, with the largest drop on Long, suggesting its importance for consolidation under long temporal de￾pendencies. Further removing the Pyramid (hierarchical) memory yields an additional decrease, again most pro￾nounced on Long and Overall. These results show that pyramid memory compleme… view at source ↗
Figure 4
Figure 4. Figure 4: A qualitative example of MM-Mem’s coarse-to-fine retrieval across memory layers. Buffer, where concrete visual cues at a finer granularity can be retrieved to support precise verification. Overall, these examples illustrate that the pyramidal mul￾timodal memory of MM-Mem supports a coarse-to-fine re￾trieval process across memory layers, allowing the model to progressively move from abstract semantic reason… view at source ↗
Figure 5
Figure 5. Figure 5: HD-EPIC++ is an egocentric long-horizon kitchen video benchmark with highly detailed annotations, covering fine-grained action perception, temporal reasoning, 3D spatial understanding, object motion, gaze, and diverse VQA tasks (e.g., recipes and ingredients). LoRA lora_rank 64 lora_alpha 128 lora_dropout 0.05 SIB-GRPO epoch 3 batch_size 8 learning_rate 1e-5 beta 0.1 ppo_clip_epsilon 0.2 use_importance_sam… view at source ↗
Figure 6
Figure 6. Figure 6: Answer Agent prompt template instructing the model to select the best option for a video-based multiple-choice ques￾tion and respond only with the corresponding letter (A–D). which indicates whether the prediction is judged correct, and (ii) Score, which is computed as the average of the assigned scores. E. Details of Prompts Prompt sources. The Answer Agent prompt shown in Fig￾ure 6 is adapted from the pr… view at source ↗
Figure 7
Figure 7. Figure 7: Prompts used by VStream-QA evaluation agent (sys￾tem and user prompts). standard VStream-QA evaluation setup. F. Additional Experiments F.1. Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes MM-Mem, a pyramidal multimodal memory architecture for long-horizon video agents in multimodal LLMs, grounded in Fuzzy-Trace Theory. It organizes memory hierarchically into a Sensory Buffer (verbatim traces), Episodic Stream, and Symbolic Schema (gist schemas) to progressively distill fine-grained perceptual information. A Semantic Information Bottleneck (SIB) objective is derived to control compression versus retention, optimized via the introduced SIB-GRPO method, with an entropy-driven top-down retrieval strategy at inference. The central claim is that this cognition-inspired design achieves state-of-the-art performance on both offline and streaming tasks across four benchmarks while demonstrating robust generalization.

Significance. If the experimental results hold with proper quantitative validation, the work would offer a meaningful advance in efficient long-context video reasoning by replacing dense visual accumulation or lossy captioning with a hierarchical, theory-grounded memory that explicitly manages the verbatim-to-gist trade-off. Public code release strengthens reproducibility and potential impact on downstream video-agent applications.

major comments (3)
  1. Abstract: the claim that MM-Mem 'achieves state-of-the-art performance on both offline and streaming tasks' across four benchmarks is presented without any quantitative numbers, ablation tables, error bars, or baseline comparisons, rendering the central empirical claim unverifiable from the provided text.
  2. Semantic Information Bottleneck section: the SIB objective is described as derived, yet the trade-off parameter is listed among free parameters and appears to require data-driven fitting; this introduces circularity because the rate-distortion balance is not shown to be independent of the fitted quantity used in SIB-GRPO.
  3. SIB-GRPO optimization: the reward is tied only to downstream task performance, so the bottleneck can discard perceptual or episodic details that are task-critical but unrewarded; no explicit mechanism or analysis demonstrates that the pyramidal hierarchy preserves such information, which is load-bearing for the claim of stable compression without loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and will incorporate revisions to improve the clarity and verifiability of our claims.

read point-by-point responses
  1. Referee: Abstract: the claim that MM-Mem 'achieves state-of-the-art performance on both offline and streaming tasks' across four benchmarks is presented without any quantitative numbers, ablation tables, error bars, or baseline comparisons, rendering the central empirical claim unverifiable from the provided text.

    Authors: We agree with this observation. The abstract in the current version summarizes the results qualitatively. In the revised manuscript, we will include specific quantitative results, such as the performance metrics on each benchmark (e.g., accuracy or success rate improvements), and direct references to the experimental tables. This will allow readers to verify the SOTA claim immediately from the abstract. revision: yes

  2. Referee: Semantic Information Bottleneck section: the SIB objective is described as derived, yet the trade-off parameter is listed among free parameters and appears to require data-driven fitting; this introduces circularity because the rate-distortion balance is not shown to be independent of the fitted quantity used in SIB-GRPO.

    Authors: The SIB objective is derived from the semantic information bottleneck principle, where the trade-off parameter λ is a hyperparameter that controls the balance between compression (rate) and retention (distortion). While λ is selected via validation to optimize performance, the functional form of the objective is independent of the data and the specific fitting process. We will revise the section to include the complete derivation, clarify that λ is a standard hyperparameter (similar to β in β-VAE), and add experiments showing the sensitivity to λ to demonstrate robustness. revision: partial

  3. Referee: SIB-GRPO optimization: the reward is tied only to downstream task performance, so the bottleneck can discard perceptual or episodic details that are task-critical but unrewarded; no explicit mechanism or analysis demonstrates that the pyramidal hierarchy preserves such information, which is load-bearing for the claim of stable compression without loss.

    Authors: This is a valid concern regarding the potential for unintended information loss. The pyramidal structure ensures that lower levels (Sensory Buffer and Episodic Stream) retain detailed information, with the SIB applied progressively to distill only at higher levels. The reward in SIB-GRPO encourages retention of task-relevant semantics, but to address this, we will include additional ablation studies in the revision that isolate the contribution of each memory level and show that removing lower levels degrades performance on tasks requiring fine-grained details. We will also provide examples illustrating preserved information across levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent from Fuzzy-Trace Theory grounding.

full rationale

The abstract states that MM-Mem is grounded in Fuzzy-Trace Theory and that the Semantic Information Bottleneck objective is derived to govern memory construction, with SIB-GRPO introduced for optimization. No equations or self-citations are provided in the visible text that reduce the central claim to a fitted parameter renamed as prediction or to a self-referential definition. The pyramidal hierarchy and entropy-driven retrieval are described as design choices validated by experiments on 4 benchmarks, without evidence that any load-bearing step collapses to its own inputs by construction. This is the most common honest outcome for papers whose core architecture is externally motivated and whose performance claims rest on empirical results rather than tautological re-derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The architecture rests on the applicability of Fuzzy-Trace Theory to video LLMs and on the existence of a tunable trade-off parameter inside the Semantic Information Bottleneck; both are introduced without independent empirical grounding in the abstract.

free parameters (1)
  • SIB trade-off parameter
    Controls compression versus retention; must be chosen or fitted to achieve the reported performance.
axioms (1)
  • domain assumption Fuzzy-Trace Theory provides a valid hierarchical structure for multimodal video memory
    The paper grounds the three-layer design directly in this psychological theory.
invented entities (1)
  • SIB-GRPO no independent evidence
    purpose: Optimizer for the semantic information bottleneck objective
    New training procedure introduced to balance memory compression and task performance.

pith-pipeline@v0.9.0 · 5564 in / 1369 out tokens · 46530 ms · 2026-05-15T18:05:46.734272+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, IlgeAkkaya, FlorenciaLeoniAleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Videominer: Iteratively grounding key frames of hour-long videos via tree- based group relative policy optimization

    Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, and Yutong Gao. Videominer: Iteratively grounding key frames of hour-long videos via tree- based group relative policy optimization. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 23773–23783, 2025

  4. [4]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and au- dio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  5. [5]

    Towards large language models with human-like episodic memory.Trends in Cognitive Sciences, 2025

    Cody V Dong, Qihong Lu, Kenneth A Norman, and Sebastian Michelmann. Towards large language models with human-like episodic memory.Trends in Cognitive Sciences, 2025

  6. [6]

    Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025

  7. [7]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

  8. [8]

    Inducing high energy-latencyoflargevision-languagemodelswith verbose images

    Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, and Wei Liu. Inducing high energy-latencyoflargevision-languagemodelswith verbose images. InICLR, 2024

  9. [9]

    Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 10 From Verbatim to Gist: Distilling Pyramidal Mul...

  10. [10]

    View from the top: Hierarchies and reverse hierarchies in the visual system.Neuron, 36(5):791–804, 2002

    Shaul Hochstein and Merav Ahissar. View from the top: Hierarchies and reverse hierarchies in the visual system.Neuron, 36(5):791–804, 2002

  11. [11]

    Licomemory: Lightweight and cognitive agentic memory for efficient long-term reasoning.arXiv preprint arXiv:2511.01448, 2025

    Zhengjun Huang, Zhoujin Tian, Qintian Guo, Fangyuan Zhang, Yingli Zhou, Di Jiang, Zeying Xie, and Xiaofang Zhou. Licomemory: Lightweight and cognitive agentic memory for efficient long-term reasoning.arXiv preprint arXiv:2511.01448, 2025

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Rad- ford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  13. [13]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xi- aochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024

  14. [14]

    Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InInternational conference on ma- chine learning, pages 12888–12900. PMLR, 2022

  15. [15]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InInternational conference on ma- chine learning, pages 19730–19742. PMLR, 2023

  16. [16]

    Llama- vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama- vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  17. [17]

    Video-llava: Learning united visual representation by alignment before projec- tion

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projec- tion. InProceedings of the 2024 conference on empir- ical methods in natural language processing, pages 5971–5984, 2024

  18. [18]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

  19. [19]

    Seeing, listening, remembering, and reasoning: A multi- modal agent with long-term memory.arXiv preprint arXiv:2508.09736, 2025

    Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multi- modal agent with long-term memory.arXiv preprint arXiv:2508.09736, 2025

  20. [20]

    Video-rag: Visually- aligned retrieval-augmented long video comprehen- sion.arXiv preprint arXiv:2411.13093, 2024

    Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, et al. Video-rag: Visually- aligned retrieval-augmented long video comprehen- sion.arXiv preprint arXiv:2411.13093, 2024

  21. [21]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12585–12602, 2024

  22. [22]

    M3-embedding: Multi-linguality, multi-functionality, multi-granularity text em- beddings through self-knowledge distillation, 2024

    Multi-Linguality Multi-Functionality Multi- Granularity. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text em- beddings through self-knowledge distillation, 2024

  23. [23]

    Memgpt: Towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: Towards llms as operating systems. 2023

  24. [24]

    Hd-epic: A highly-detailed egocen- tric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocen- tric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

  25. [25]

    Streaming long video understanding with large lan- guage models.Advances in Neural Information Pro- cessing Systems, 37:119336–119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large lan- guage models.Advances in Neural Information Pro- cessing Systems, 37:119336–119360, 2024

  26. [26]

    Fuzzy- trace theory: An interim synthesis.Learning and Individual Differences, 7(1):1–75, 1995

    Valerie F Reyna and Charles J Brainerd. Fuzzy- trace theory: An interim synthesis.Learning and Individual Differences, 7(1):1–75, 1995

  27. [27]

    Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding.arXiv preprint arXiv:2510.14032, 2025

    Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding.arXiv preprint arXiv:2510.14032, 2025

  28. [28]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

  29. [29]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, 11 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of cont...

  30. [30]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

  31. [31]

    Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023

    Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and ver- satile video understanding system.arXiv preprint arXiv:2304.14407, 2023

  32. [32]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  33. [33]

    Videoagent: Long-form video under- standing with large language model as agent

    XiaohanWang, YuhuiZhang, OrrZohar, andSerena Yeung-Levy. Videoagent: Long-form video under- standing with large language model as agent. In European Conference on Computer Vision, pages 58–

  34. [34]

    Longllava: Scaling multi- modalllmsto1000imagesefficientlyviaahybridar- chitecture.arXiv preprint arXiv:2409.02889, 2024

    Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi- modalllmsto1000imagesefficientlyviaahybridar- chitecture.arXiv preprint arXiv:2409.02889, 2024

  35. [35]

    Videotree: Adaptive tree-based video representation for llm reasoning on long videos

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jae- hong Yoon, Feng Cheng, Gedas Bertasius, and Mo- hit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272–3283, 2025

  36. [36]

    C-pack: Packed resources for general chinese embeddings

    ShitaoXiao, ZhengLiu, PeitianZhang, NiklasMuen- nighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in informa- tion retrieval, pages 641–649, 2024

  37. [37]

    Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

    Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024

  38. [38]

    A-MEM: Agentic Memory for LLM Agents

    WujiangXu, ZujieLiang, KaiMei, HangGao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  39. [39]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, JeffZPan, HinrichSchütze, etal. Memory- r1: Enhancing large language model agents to man- age and utilize memories via reinforcement learn- ing.arXiv preprint arXiv:2508.19828, 2025

  40. [40]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videol- lama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  41. [41]

    Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time under- standing for long video streams.arXiv preprint arXiv:2406.08085, 2024

  42. [42]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

  43. [43]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  44. [44]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InPro- ceedings of the AAAI Conference on Artificial Intelli- gence, volume 38, pages 19724–19731, 2024

  45. [45]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691–13701, 2025. 12 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic ...

  46. [46]

    Select the best answer to the following multiple-choice question based on the video.\n

    is adopted as an automatic judge. Given a model prediction, whether the prediction is correct is judged, and a score between 0 and 5 is assigned. Accordingly, two metrics are reported on VStream-QA: (i) Accuracy, "Select the best answer to the following multiple-choice question based on the video.\n" "Respond with only the letter (A, B, C, or D) of the co...