Recognition: 2 theorem links
· Lean TheoremAgentic Learner with Grow-and-Refine Multimodal Semantic Memory
Pith reviewed 2026-05-17 04:23 UTC · model grok-4.3
The pith
A dual-stream memory system lets multimodal models accumulate and refine integrated visual and logical knowledge from past experiences without repeating mistakes or forgetting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViLoMem constructs compact, schema-based memory through a dual-stream architecture that separately encodes visual distraction patterns and logical reasoning errors. Following a grow-and-refine principle, the system incrementally accumulates successful and failed experiences into stable multimodal semantic knowledge. This avoids catastrophic forgetting and produces generalizable strategies, resulting in higher pass@1 accuracy and fewer repeated visual and logical errors across six multimodal benchmarks. Ablations show that the explicit separation of distraction and hallucination patterns is necessary for these gains.
What carries the argument
ViLoMem dual-stream memory framework that separately encodes visual distraction patterns and logical reasoning errors and updates them via incremental grow-and-refine cycles.
If this is right
- Higher pass@1 accuracy on six multimodal benchmarks.
- Substantial reduction in repeated visual and logical errors.
- Preservation of stable, generalizable strategies across tasks.
- Support for lifelong and cross-domain agentic learning without catastrophic forgetting.
Where Pith is reading between the lines
- The same separation of visual and logical error streams could be tested in single-modality settings to isolate whether the multimodal integration is what drives stability.
- Longer sequences of tasks might reveal whether grow-and-refine eventually requires explicit compression rules to keep memory size bounded.
- The framework suggests that agentic systems in other modalities could benefit from storing failure modes explicitly rather than only successful trajectories.
Load-bearing premise
Separately encoding visual distraction patterns and logical reasoning errors in dual streams, then updating them incrementally, will produce stable multimodal semantic memory that avoids forgetting and generalizes across domains.
What would settle it
Apply the system to a sequence of multimodal tasks in a held-out domain and measure whether pass@1 accuracy fails to rise or repeated visual and logical errors fail to decline relative to a no-memory baseline.
Figures
read the original abstract
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction-hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page is available at https://weihao-bo.github.io/ViLoMeo-page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ViLoMem, a dual-stream memory framework for multimodal large language models (MLLMs) that separately encodes visual distraction patterns and logical reasoning errors into distinct streams. It employs a grow-and-refine principle to incrementally accumulate and update compact, schema-based multimodal semantic memory, aiming to learn from both successful and failed experiences while avoiding catastrophic forgetting and reducing brevity bias found in trajectory-based approaches. The paper claims consistent pass@1 accuracy improvements and substantial reductions in repeated visual and logical errors across six multimodal benchmarks, with ablations confirming the value of explicit distraction-hallucination separation for lifelong and cross-domain agentic learning.
Significance. If the empirical claims are substantiated with detailed quantitative results, statistical tests, and verification of the error-separation assumption, this could represent a meaningful advance in agentic multimodal learning by providing an integrated yet partitioned semantic memory mechanism that better aligns with human cognition. The grow-and-refine update strategy addresses key shortcomings of existing memory-augmented agents, potentially enabling more stable generalization across domains without the loss of essential visual-logical coordination.
major comments (2)
- [Abstract] Abstract: The central empirical claim states that ViLoMem 'consistently improves pass@1 accuracy' and 'substantially reduces repeated visual and logical errors' across six benchmarks, yet no specific numerical deltas, baseline comparisons, error bars, or statistical significance tests are reported. This absence leaves the magnitude and reliability of the improvements unverifiable and weakens support for the dual-stream necessity.
- [Abstract] Abstract (ablations on dual-stream): The claim that ablations confirm 'the necessity of dual-stream memory with explicit distraction-hallucination separation' does not include quantitative metrics on error-type classification purity, routing accuracy for hybrid visual-logical errors, or failure modes when mixed errors occur. Since many MLLM failures involve hybrid cases (e.g., incorrect visual attention producing flawed logical inferences), this omission directly challenges whether the separate streams can reliably prevent duplication or context loss during grow-and-refine updates.
minor comments (1)
- [Abstract] The project page URL contains a likely typo ('ViLoMeo-page' vs. 'ViLoMem'); ensure consistency with the paper title and framework name.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. The feedback highlights important aspects of how we present our empirical claims and ablations, and we have revised the manuscript to address these points directly. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim states that ViLoMem 'consistently improves pass@1 accuracy' and 'substantially reduces repeated visual and logical errors' across six benchmarks, yet no specific numerical deltas, baseline comparisons, error bars, or statistical significance tests are reported. This absence leaves the magnitude and reliability of the improvements unverifiable and weakens support for the dual-stream necessity.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately assess the scale of the reported gains. The full experimental section already contains detailed tables reporting per-benchmark pass@1 scores, average improvements over strong baselines (including both trajectory-based and single-stream memory methods), standard deviations across runs, and paired statistical significance tests. In the revised version we will condense the key quantitative highlights into the abstract (e.g., average pass@1 lift of X% and repeated-error reductions of Y–Z% across the six benchmarks) while still respecting length constraints. This change makes the empirical support for the dual-stream design more transparent without altering any results. revision: yes
-
Referee: [Abstract] Abstract (ablations on dual-stream): The claim that ablations confirm 'the necessity of dual-stream memory with explicit distraction-hallucination separation' does not include quantitative metrics on error-type classification purity, routing accuracy for hybrid visual-logical errors, or failure modes when mixed errors occur. Since many MLLM failures involve hybrid cases (e.g., incorrect visual attention producing flawed logical inferences), this omission directly challenges whether the separate streams can reliably prevent duplication or context loss during grow-and-refine updates.
Authors: The referee correctly notes that hybrid visual-logical errors are prevalent and that explicit quantification of separation quality would strengthen the ablation claims. Our existing ablations already demonstrate that the dual-stream model outperforms both a merged single-stream variant and a no-memory baseline on repeated-error metrics, indicating that explicit separation is beneficial. However, we did not report classification purity, routing accuracy on hybrid examples, or a dedicated failure-mode analysis for mixed errors. We will add a new ablation subsection that (1) measures precision and recall of the error-type classifier, (2) evaluates routing accuracy on a curated set of hybrid-error cases, and (3) examines whether grow-and-refine updates introduce duplication or context loss in those cases. These additions will directly address the reliability of the partitioned streams. revision: yes
Circularity Check
No circularity: empirical architecture validated on external benchmarks
full rationale
The paper describes ViLoMem, a dual-stream grow-and-refine memory framework that separately encodes visual distraction patterns and logical reasoning errors for MLLM agents. All central claims rest on pass@1 accuracy gains and error reductions measured across six independent multimodal benchmarks plus ablations, with no equations, fitted parameters, or derivations presented that reduce by construction to the method's own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the architecture; the design choices are presented as motivated by cognitive alignment and then tested externally. The work is therefore self-contained against independent evaluation rather than internally circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- Memory update thresholds and schema compaction rules
axioms (2)
- domain assumption Semantic memory benefits from separate but coordinated visual and abstract streams
- domain assumption Storing past errors enables MLLMs to avoid repeating them in future multimodal tasks
invented entities (1)
-
ViLoMem dual-stream memory framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ViLoMem, a dual-stream memory framework that separately encodes visual distraction patterns and logical reasoning errors... Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ablations confirm the necessity of dual-stream memory with explicit distraction-hallucination separation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Za- heer Abbas, Azade Nova, et al. Many-shot in-context learn- ing.Advances in Neural Information Processing Systems, 37:76930–76966, 2024. 2
work page 2024
-
[2]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Llm in a flash: Efficient large language model inference with limited memory
Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584, 2024. 3
work page 2024
-
[4]
Interactive continual learning architecture for long-term per- sonalization of home service robots
Ali Ayub, Chrystopher L Nehaniv, and Kerstin Dautenhahn. Interactive continual learning architecture for long-term per- sonalization of home service robots. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 11289–11296. IEEE, 2024. 3
work page 2024
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Yuxuan Cai, Yipeng Hao, Jie Zhou, et al. Building self-evolving agents via experience-driven lifelong learn- ing: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025. 3
-
[7]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. MMStar: Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Lifelong knowledge editing for llms with retrieval-augmented continuous prompt learning
Qizhou Chen, Taolin Zhang, Xiaofeng He, Dongyang Li, Chengyu Wang, Longtao Huang, et al. Lifelong knowledge editing for llms with retrieval-augmented continuous prompt learning. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 13565– 13580, 2024. 2
work page 2024
-
[9]
Alex Clarke and Lorraine K Tyler. Object-specific semantic coding in human perirhinal cortex.Journal of Neuroscience, 34(14):4766–4775, 2014. 3
work page 2014
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Videoagent: A memory-augmented mul- timodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Con- ference on Computer Vision, pages 75–92. Springer, 2024. 3
work page 2024
-
[12]
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025. 3
-
[13]
Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xin- hao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation.arXiv preprint arXiv:2501.01329, 2025. 2
-
[16]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. HallusionBench: An advanced di- agnostic suite for entangled language hallucination and vi- sual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Philipp Kuhnke, Curtiss A Chapman, Vincent KM Cheung, Sabrina Turker, Astrid Graessner, Sandra Martin, Kathleen A Williams, and Gesa Hartwigsen. The role of the angular gyrus in semantic cognition: a synthesis of five functional neuroimaging studies.Brain Structure and Function, 228 (1):273–291, 2023. 3
work page 2023
-
[18]
Matthew A Lambon Ralph, Karen Sage, Roy W Jones, and Emily J Mayberry. Coherent concepts are computed in the anterior temporal lobes.Proceedings of the National Academy of Sciences, 107(6):2717–2722, 2010. 3
work page 2010
-
[19]
Lominger, Minneapolis, 1st edition, 1996
Michael M Lombardo and Robert W Eichinger.The Career Architect Development Planner. Lominger, Minneapolis, 1st edition, 1996. 3
work page 1996
-
[21]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
VLMEvalKit: Open-source evaluation toolkit for large vision-language models, 2024
OpenCompass Contributors. VLMEvalKit: Open-source evaluation toolkit for large vision-language models, 2024. 6, 3
work page 2024
-
[23]
C ´esar Santos, Fumio Machida, and Ermeson Andrade. Ex- perimental investigation of memory-related software aging in llm systems.Journal of Systems and Software, page 112653, 2025. 3
work page 2025
-
[24]
Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, and Pang Wei W 9 Koh. Scaling retrieval-based language models with a trillion- token datastore.Advances in Neural Information Processing Systems, 37:91260–91299, 2024. 2
work page 2024
-
[25]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural In- formation Processing Systems, 36:8634–8652, 2023. 2
work page 2023
-
[26]
MathGlance: A benchmark for math at a glance understanding.arXiv preprint, 2025
Hao Sun et al. MathGlance: A benchmark for math at a glance understanding.arXiv preprint, 2025. Placeholder - shows visual perception bottleneck in mathematical reason- ing; update with full citation when available. 1
work page 2025
-
[27]
Dynamic cheatsheet: Test- time learning with adaptive memory.arXiv preprint arXiv:2504.07952, 2025
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test- time learning with adaptive memory.arXiv preprint arXiv:2504.07952, 2025. 1, 3, 2
-
[28]
Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, et al. In prospect and retrospect: Reflective mem- ory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 8416–843...
work page 2025
-
[29]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs.arXiv preprint arXiv:2401.06209, 2024. Identifies nine visual patterns where MLLMs systematically fail; shows CLIP encoder lim- itations cascade to reasoning failures. 1
-
[31]
Improving code localization with repository memory.arXiv preprint arXiv:2510.01003, 2025
Boshi Wang, Weijian Xu, Yunsheng Li, Mei Gao, Yu- jia Xie, Huan Sun, and Dongdong Chen. Improving code localization with repository memory.arXiv preprint arXiv:2510.01003, 2025. 3
-
[32]
Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O Arik. Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30553–30571, 2025. 2
work page 2025
-
[33]
Ke Wang, Junting Ren, Weikang Yuan, Sicong Wang, Zihao Yang, Wentao Ma, and Wanli Ouyang. MATH-Vision: A challenging mathematical reasoning benchmark requiring vi- sual understanding.arXiv preprint arXiv:2409.13925, 2024. 5
-
[34]
Xiaohan Wang et al. MEG evidence that modality- independent conceptual representations contain semantic and visual features.Journal of Neuroscience, 44(28), 2024. Evidence that ATL acts as hub integrating sensory-motor fea- tures into coherent conceptual representations. 2
work page 2024
-
[35]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Combating mul- timodal LLM hallucination via bottom-up holistic reasoning
Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating mul- timodal LLM hallucination via bottom-up holistic reasoning. arXiv preprint arXiv:2412.11124, 2024. Shows insufficient visual comprehension causes hallucinations; identifies ob- ject, attribute, and relationship perception errors. 1
-
[37]
Extending context window of large language models from a distributional perspective
Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, and Bing Qin. Extending context window of large language models from a distributional perspective. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7288–7301, 2024. 2
work page 2024
-
[38]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
RealWorldQA: A benchmark for real-world vi- sual understanding, 2024
xAI Team. RealWorldQA: A benchmark for real-world vi- sual understanding, 2024. Real-world spatial understanding benchmark with 765 images. 5
work page 2024
-
[40]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Reinforced interac- tive continual learning via real-time noisy human feedback
Yutao Yang, Jie Zhou, Junsong Li, Qianjun Pan, Bihao Zhan, Qin Chen, Xipeng Qiu, and Liang He. Reinforced interac- tive continual learning via real-time noisy human feedback. arXiv preprint arXiv:2505.09925, 2025. 3
-
[42]
React: Synergizing rea- soning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InInternational Con- ference on Learning Representations, 2023. 2
work page 2023
-
[43]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi.arXiv preprint arXiv:2311.16502, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
TextGrad: Automatic "Differentiation" via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text.arXiv preprint arXiv:2406.07496, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Primitive visual perception for multi- modal reasoning.arXiv preprint, 2025
Author Names Zhang. Primitive visual perception for multi- modal reasoning.arXiv preprint, 2025. Placeholder - shows 72-78% of math reasoning failures stem from perception er- rors exceeding logic errors; update with full citation. 1, 6
work page 2025
-
[46]
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Mllms know where to look: Training-free per- ception of small visual details with multimodal llms
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2
work page 2025
-
[48]
Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025
Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xi- aohan Fu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025. 3 10
-
[49]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context en- gineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Min Zhang, Wen Zhang, and Huajun Chen. Abstractive vi- sual understanding of multi-modal structured knowledge: A new perspective for mllm evaluation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12323–12332, 2025. 1
work page 2025
-
[51]
Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43 (6):1–47, 2025. 1, 3
work page 2025
-
[52]
Efficient motion-aware video mllm
Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, and Jing Liu. Efficient motion-aware video mllm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24159– 24168, 2025. 1
work page 2025
-
[53]
Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, et al. From perception to cognition: A sur- vey of vision-language interactive reasoning in multimodal large language models.arXiv preprint arXiv:2509.25373,
-
[54]
Comprehensive survey on perception-cognition dis- connect; shows static visual processing causes decoupling between answers and visual facts. 1
-
[55]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1 11 Agentic Learner with Grow-and-Refine Multimodal Semantic Memory Supplementary Material
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Additional Results and Ablation Study 6.1. Integration with more models To verify the flexibility ofViLoMem, we extend our eval- uation beyond the main experiments to recent reasoning- enhanced models, including GLM-4.1v [29], InternVL3- 38B [54], and Gemini 2.5 [10]. As shown in Table 5, ViLoMemdemonstrates robust adaptability across differ- ent architec...
-
[57]
Additional Experimental Details This section provides additional implementation details that complement the experimental setup. Model Deployment.For open-source models, we deploy most checkpoints usingvLLMfor efficient batched infer- ence. Due to its scale,Qwen3-VL-235B-A22B-Instruct is accessed via its official API instead of local deploy- ment, and all ...
-
[58]
Prompt Templates We provide the full prompt templates used in our frame- work, including the step-by-step reasoning prompt used in theStepconfiguration (Figure 6), the Problem Analysis Prompt (Figure 7), the Logical Memory Generation Prompt (Figure 8), and the Visual Memory Generation Prompt (Figure 9), together with the LLM-as-a-judge verification prompt...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.