Recognition: no theorem link
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
Pith reviewed 2026-05-15 18:05 UTC · model grok-4.3
The pith
A pyramidal memory architecture distills fine-grained video perceptions into compact semantic schemas for efficient long-horizon reasoning in multimodal agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MM-Mem structures memory hierarchically into a Sensory Buffer for raw perceptual traces, an Episodic Stream for intermediate events, and a Symbolic Schema for high-level gist, using a Semantic Information Bottleneck objective and SIB-GRPO optimization to distill verbatim details into task-relevant semantics while preserving necessary information. An entropy-driven top-down retrieval strategy then accesses the appropriate memory level during inference.
What carries the argument
MM-Mem pyramidal architecture with Sensory Buffer, Episodic Stream, and Symbolic Schema layers, controlled by the Semantic Information Bottleneck objective and SIB-GRPO training, plus entropy-driven top-down retrieval.
If this is right
- Long-horizon video tasks become feasible without quadratic growth in context or latency from raw frame accumulation.
- Memory usage shifts from dense visual storage to compact symbolic representations while retaining task utility.
- Streaming video agents gain robust generalization across offline and online benchmarks through the same hierarchical organization.
- Cognition-inspired memory hierarchies prove effective for balancing compression against retention in multimodal LLMs.
Where Pith is reading between the lines
- The same distillation pattern could be tested on long audio sequences or robot sensor streams where raw data is similarly voluminous.
- Hybrid systems might combine this pyramidal structure with external knowledge bases to handle even longer horizons without retraining.
- If the bottleneck objective generalizes, it offers a route to parameter-efficient continual learning for video agents that avoids catastrophic forgetting of early episodes.
Load-bearing premise
The hierarchical levels drawn from Fuzzy-Trace Theory plus the Semantic Information Bottleneck loss can be stably trained in multimodal models without discarding task-critical details that fall outside the training reward signal.
What would settle it
An experiment showing that MM-Mem drops key visual or temporal details on a long video benchmark where dense baselines retain them and the model fails to reach state-of-the-art accuracy.
Figures
read the original abstract
While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MM-Mem, a pyramidal multimodal memory architecture for long-horizon video agents in multimodal LLMs, grounded in Fuzzy-Trace Theory. It organizes memory hierarchically into a Sensory Buffer (verbatim traces), Episodic Stream, and Symbolic Schema (gist schemas) to progressively distill fine-grained perceptual information. A Semantic Information Bottleneck (SIB) objective is derived to control compression versus retention, optimized via the introduced SIB-GRPO method, with an entropy-driven top-down retrieval strategy at inference. The central claim is that this cognition-inspired design achieves state-of-the-art performance on both offline and streaming tasks across four benchmarks while demonstrating robust generalization.
Significance. If the experimental results hold with proper quantitative validation, the work would offer a meaningful advance in efficient long-context video reasoning by replacing dense visual accumulation or lossy captioning with a hierarchical, theory-grounded memory that explicitly manages the verbatim-to-gist trade-off. Public code release strengthens reproducibility and potential impact on downstream video-agent applications.
major comments (3)
- Abstract: the claim that MM-Mem 'achieves state-of-the-art performance on both offline and streaming tasks' across four benchmarks is presented without any quantitative numbers, ablation tables, error bars, or baseline comparisons, rendering the central empirical claim unverifiable from the provided text.
- Semantic Information Bottleneck section: the SIB objective is described as derived, yet the trade-off parameter is listed among free parameters and appears to require data-driven fitting; this introduces circularity because the rate-distortion balance is not shown to be independent of the fitted quantity used in SIB-GRPO.
- SIB-GRPO optimization: the reward is tied only to downstream task performance, so the bottleneck can discard perceptual or episodic details that are task-critical but unrewarded; no explicit mechanism or analysis demonstrates that the pyramidal hierarchy preserves such information, which is load-bearing for the claim of stable compression without loss.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and will incorporate revisions to improve the clarity and verifiability of our claims.
read point-by-point responses
-
Referee: Abstract: the claim that MM-Mem 'achieves state-of-the-art performance on both offline and streaming tasks' across four benchmarks is presented without any quantitative numbers, ablation tables, error bars, or baseline comparisons, rendering the central empirical claim unverifiable from the provided text.
Authors: We agree with this observation. The abstract in the current version summarizes the results qualitatively. In the revised manuscript, we will include specific quantitative results, such as the performance metrics on each benchmark (e.g., accuracy or success rate improvements), and direct references to the experimental tables. This will allow readers to verify the SOTA claim immediately from the abstract. revision: yes
-
Referee: Semantic Information Bottleneck section: the SIB objective is described as derived, yet the trade-off parameter is listed among free parameters and appears to require data-driven fitting; this introduces circularity because the rate-distortion balance is not shown to be independent of the fitted quantity used in SIB-GRPO.
Authors: The SIB objective is derived from the semantic information bottleneck principle, where the trade-off parameter λ is a hyperparameter that controls the balance between compression (rate) and retention (distortion). While λ is selected via validation to optimize performance, the functional form of the objective is independent of the data and the specific fitting process. We will revise the section to include the complete derivation, clarify that λ is a standard hyperparameter (similar to β in β-VAE), and add experiments showing the sensitivity to λ to demonstrate robustness. revision: partial
-
Referee: SIB-GRPO optimization: the reward is tied only to downstream task performance, so the bottleneck can discard perceptual or episodic details that are task-critical but unrewarded; no explicit mechanism or analysis demonstrates that the pyramidal hierarchy preserves such information, which is load-bearing for the claim of stable compression without loss.
Authors: This is a valid concern regarding the potential for unintended information loss. The pyramidal structure ensures that lower levels (Sensory Buffer and Episodic Stream) retain detailed information, with the SIB applied progressively to distill only at higher levels. The reward in SIB-GRPO encourages retention of task-relevant semantics, but to address this, we will include additional ablation studies in the revision that isolate the contribution of each memory level and show that removing lower levels degrades performance on tasks requiring fine-grained details. We will also provide examples illustrating preserved information across levels. revision: yes
Circularity Check
No significant circularity; derivation presented as independent from Fuzzy-Trace Theory grounding.
full rationale
The abstract states that MM-Mem is grounded in Fuzzy-Trace Theory and that the Semantic Information Bottleneck objective is derived to govern memory construction, with SIB-GRPO introduced for optimization. No equations or self-citations are provided in the visible text that reduce the central claim to a fitted parameter renamed as prediction or to a self-referential definition. The pyramidal hierarchy and entropy-driven retrieval are described as design choices validated by experiments on 4 benchmarks, without evidence that any load-bearing step collapses to its own inputs by construction. This is the most common honest outcome for papers whose core architecture is externally motivated and whose performance claims rest on empirical results rather than tautological re-derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- SIB trade-off parameter
axioms (1)
- domain assumption Fuzzy-Trace Theory provides a valid hierarchical structure for multimodal video memory
invented entities (1)
-
SIB-GRPO
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, IlgeAkkaya, FlorenciaLeoniAleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, and Yutong Gao. Videominer: Iteratively grounding key frames of hour-long videos via tree- based group relative policy optimization. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 23773–23783, 2025
work page 2025
-
[4]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and au- dio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Towards large language models with human-like episodic memory.Trends in Cognitive Sciences, 2025
Cody V Dong, Qihong Lu, Kenneth A Norman, and Sebastian Michelmann. Towards large language models with human-like episodic memory.Trends in Cognitive Sciences, 2025
work page 2025
-
[6]
Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025
work page 2025
-
[7]
Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025
-
[8]
Inducing high energy-latencyoflargevision-languagemodelswith verbose images
Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, and Wei Liu. Inducing high energy-latencyoflargevision-languagemodelswith verbose images. InICLR, 2024
work page 2024
-
[9]
Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 10 From Verbatim to Gist: Distilling Pyramidal Mul...
work page 2024
-
[10]
Shaul Hochstein and Merav Ahissar. View from the top: Hierarchies and reverse hierarchies in the visual system.Neuron, 36(5):791–804, 2002
work page 2002
-
[11]
Zhengjun Huang, Zhoujin Tian, Qintian Guo, Fangyuan Zhang, Yingli Zhou, Di Jiang, Zeying Xie, and Xiaofang Zhou. Licomemory: Lightweight and cognitive agentic memory for efficient long-term reasoning.arXiv preprint arXiv:2511.01448, 2025
-
[12]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Rad- ford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xi- aochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024
work page 2024
-
[14]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InInternational conference on ma- chine learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[15]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InInternational conference on ma- chine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[16]
Llama- vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama- vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024
work page 2024
-
[17]
Video-llava: Learning united visual representation by alignment before projec- tion
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projec- tion. InProceedings of the 2024 conference on empir- ical methods in natural language processing, pages 5971–5984, 2024
work page 2024
-
[18]
Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023
work page 2023
-
[19]
Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multi- modal agent with long-term memory.arXiv preprint arXiv:2508.09736, 2025
-
[20]
Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, et al. Video-rag: Visually- aligned retrieval-augmented long video comprehen- sion.arXiv preprint arXiv:2411.13093, 2024
-
[21]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12585–12602, 2024
work page 2024
-
[22]
Multi-Linguality Multi-Functionality Multi- Granularity. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text em- beddings through self-knowledge distillation, 2024
work page 2024
-
[23]
Memgpt: Towards llms as operating systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: Towards llms as operating systems. 2023
work page 2023
-
[24]
Hd-epic: A highly-detailed egocen- tric video dataset
Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocen- tric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025
work page 2025
-
[25]
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large lan- guage models.Advances in Neural Information Pro- cessing Systems, 37:119336–119360, 2024
work page 2024
-
[26]
Fuzzy- trace theory: An interim synthesis.Learning and Individual Differences, 7(1):1–75, 1995
Valerie F Reyna and Charles J Brainerd. Fuzzy- trace theory: An interim synthesis.Learning and Individual Differences, 7(1):1–75, 1995
work page 1995
-
[27]
Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding.arXiv preprint arXiv:2510.14032, 2025
-
[28]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024
work page 2024
-
[29]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, 11 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of cont...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[31]
Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and ver- satile video understanding system.arXiv preprint arXiv:2304.14407, 2023
-
[32]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Videoagent: Long-form video under- standing with large language model as agent
XiaohanWang, YuhuiZhang, OrrZohar, andSerena Yeung-Levy. Videoagent: Long-form video under- standing with large language model as agent. In European Conference on Computer Vision, pages 58–
-
[34]
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi- modalllmsto1000imagesefficientlyviaahybridar- chitecture.arXiv preprint arXiv:2409.02889, 2024
-
[35]
Videotree: Adaptive tree-based video representation for llm reasoning on long videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jae- hong Yoon, Feng Cheng, Gedas Bertasius, and Mo- hit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272–3283, 2025
work page 2025
-
[36]
C-pack: Packed resources for general chinese embeddings
ShitaoXiao, ZhengLiu, PeitianZhang, NiklasMuen- nighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in informa- tion retrieval, pages 641–649, 2024
work page 2024
-
[37]
Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024
Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024
-
[38]
A-MEM: Agentic Memory for LLM Agents
WujiangXu, ZujieLiang, KaiMei, HangGao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, JeffZPan, HinrichSchütze, etal. Memory- r1: Enhancing large language model agents to man- age and utilize memories via reinforcement learn- ing.arXiv preprint arXiv:2508.19828, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videol- lama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time under- standing for long video streams.arXiv preprint arXiv:2406.08085, 2024
-
[42]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InPro- ceedings of the AAAI Conference on Artificial Intelli- gence, volume 38, pages 19724–19731, 2024
work page 2024
-
[45]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691–13701, 2025. 12 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic ...
work page 2025
-
[46]
Select the best answer to the following multiple-choice question based on the video.\n
is adopted as an automatic judge. Given a model prediction, whether the prediction is correct is judged, and a score between 0 and 5 is assigned. Accordingly, two metrics are reported on VStream-QA: (i) Accuracy, "Select the best answer to the following multiple-choice question based on the video.\n" "Respond with only the letter (A, B, C, or D) of the co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.