MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning
Pith reviewed 2026-05-21 15:10 UTC · model grok-4.3
The pith
MemOCR renders structured memory as layout-rich images so agents can visually prioritize key evidence while compressing the rest under tight budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemOCR maintains a structured rich-text memory and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details; the system is trained with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels.
What carries the argument
Layout-aware visual memory created by rendering structured rich-text (headings, highlights) into an image for direct visual consultation by the agent.
If this is right
- Outperforms strong text-based baselines across long-context multi-hop and single-hop question-answering benchmarks.
- Achieves more effective context utilization when memory budgets are pushed to extreme levels.
- Allocates memory space with adaptive information density through visual layout rather than uniform token cost.
- Remains robust when trained under budget-aware reinforcement learning that varies compression levels.
Where Pith is reading between the lines
- The same visual-rendering step could be tested on planning or tool-use agents that also accumulate long histories.
- Hybrid text-plus-image memory might reduce the need for ever-larger context windows in general agent architectures.
- If visual prioritization works, it suggests that future memory systems should treat layout and salience as first-class resources rather than after-the-fact compression.
Load-bearing premise
That turning structured memory into an image will let the agent reliably pick out crucial evidence and drop auxiliary details without losing what it needs, no matter how small the budget becomes.
What would settle it
On the same long-context QA benchmarks, measure whether accuracy and evidence-use scores drop below strong text baselines once the memory budget is forced below a fixed low threshold such as 512 tokens.
Figures
read the original abstract
Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MemOCR, a multimodal memory agent for long-horizon reasoning under tight context budgets. It maintains structured rich-text memory (headings, highlights), renders it to an image for visual consultation by the agent, and trains the system via reinforcement learning with budget-aware objectives to enable adaptive compression. The central claim is that this visual-layout approach allows prioritizing crucial evidence while aggressively compressing auxiliary details, yielding better performance than text-based baselines on long-context multi-hop and single-hop QA benchmarks.
Significance. If the core mechanism holds, the work offers a promising direction for efficient memory management in agentic systems by shifting from uniform text token costs to layout-driven visual density. The budget-aware RL training is a positive design choice for robustness. The approach could influence multimodal agent architectures if the visual prioritization proves reliable.
major comments (2)
- [§3] §3 (Method): The rendering of structured rich-text into images and the mechanism by which the vision encoder/policy exploits layout cues (headings, highlights) for selective prioritization rather than uniform processing are not described in sufficient detail. No information is given on image resolution, scaling with budget, or how the RL objective specifically enforces layout-based attention. This is load-bearing for the central claim, as failure here would collapse the advantage over text baselines.
- [§4] §4 (Experiments): The reported outperformance on QA benchmarks lacks ablations that isolate the visual layout component from other factors such as the RL objective or memory structure. Without these, it is unclear whether gains stem from the claimed visual prioritization or from other implementation choices.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit quantification of 'extreme budgets' (e.g., specific token or pixel limits) to ground the claims.
- [Figures] Figure captions describing rendered memory examples could more clearly annotate how layout elements are preserved or compressed at different budgets.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have prepared revisions to the manuscript to improve clarity and experimental rigor.
read point-by-point responses
-
Referee: [§3] §3 (Method): The rendering of structured rich-text into images and the mechanism by which the vision encoder/policy exploits layout cues (headings, highlights) for selective prioritization rather than uniform processing are not described in sufficient detail. No information is given on image resolution, scaling with budget, or how the RL objective specifically enforces layout-based attention. This is load-bearing for the central claim, as failure here would collapse the advantage over text baselines.
Authors: We agree that the current description in Section 3 is insufficiently detailed on these points. In the revised manuscript we will expand the method section to specify the rich-text to image rendering procedure (including layout preservation for headings and highlights), the image resolution and its scaling with available budget, the vision encoder's handling of visual density cues, and the precise formulation of the budget-aware RL objective that encourages layout-driven prioritization. These additions will directly support the central claim. revision: yes
-
Referee: [§4] §4 (Experiments): The reported outperformance on QA benchmarks lacks ablations that isolate the visual layout component from other factors such as the RL objective or memory structure. Without these, it is unclear whether gains stem from the claimed visual prioritization or from other implementation choices.
Authors: We acknowledge the validity of this concern. While the existing comparisons to text baselines demonstrate overall gains, we will add targeted ablations in the revised experiments section. These will include a text-only memory variant trained with the same RL procedure, removal of specific layout elements (headings/highlights), and controlled variations of the RL objective to isolate the contribution of visual layout prioritization. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent RL training and benchmark evaluation
full rationale
The paper presents MemOCR as a new architecture that maintains structured rich-text memory, renders it to images, and trains the agent via reinforcement learning under budget-aware objectives to enable visual prioritization of evidence. Performance gains are reported as direct empirical outcomes on long-context multi-hop and single-hop QA benchmarks against text baselines. No equations, fitted parameters, or predictions are shown to reduce by construction to the inputs; the layout-exploitation assumption is tested rather than defined into existence, and no load-bearing self-citations or uniqueness theorems are invoked in the provided description. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 6 Pith papers
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Realmem: Bench- marking llms in real-world memory-driven interaction
Bian, H., Yao, Z., Hu, S., Xu, Z., Zhang, S., Guo, Y ., Yang, Z., Han, X., Wang, H., and Chen, R. Realmem: Bench- marking llms in real-world memory-driven interaction. arXiv preprint arXiv:2601.06966,
-
[4]
Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,
Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., et al. Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,
-
[5]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Chhikara, P., Khant, D., Aryan, S., Singh, T., and Yadav, D. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G.,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
doi: 10.48550/ARXIV .2501.12948. Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025a. Du, Y ., Huang, W., Zheng, D., Wang, Z., Montella, S., Lapata, M., Wong, K.-F., and Pan, J. Z. Rethinking memory in ai: Taxonomy, operations, topics, and future dir...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[9]
LightMem: Lightweight and Efficient Memory-Augmented Generation
URL https://doi.org/10. 48550/arXiv.2510.18866. 9 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning Feng, L., Yang, F., Chen, F., Cheng, X., Xu, H., Wan, Z., Yan, M., and An, B. Agentocr: Reimagining agent history via optical self-compression.arXiv preprint arXiv:2601.04786,
work page internal anchor Pith review arXiv
-
[10]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Hu, S., Wei, Y ., Ran, J., Yao, Z., and Zou, L. Does mem- ory need graphs? a unified framework and empirical analysis for long-term dialog memory.arXiv preprint arXiv:2601.01280,
-
[12]
Memory in the Age of AI Agents
Hu, Y ., Liu, S., Yue, Y ., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Jin, B., Zeng, H., Yue, Z., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and lever- age search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
MemOS: A Memory OS for AI System
Li, Z., Song, S., Xi, C., Wang, H., Tang, C., Niu, S., Chen, D., Yang, J., Li, C., Yu, Q., et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Luo, J., Zhang, W., Yuan, Y ., Zhao, Y ., Yang, J., Gu, Y ., Wu, B., Chen, B., Qiao, Z., Long, Q., et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Matarazzo, A. and Torlone, R. A survey on large language models with some insights on their capabilities and limi- tations.arXiv preprint arXiv:2501.04040,
-
[17]
A., Yoon, S., and Sch ¨utze, H
Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., and Sch ¨utze, H. Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,
-
[18]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Sheng, L., Zhang, Y ., Ma, W., Shi, Y ., Huang, T., Wang, X., Zhang, A., Shen, K., and Chua, T.-S. When to mem- orize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,
-
[22]
Shi, Y ., Chen, Y ., Wang, S., Li, S., Cai, H., Gu, Q., Wang, X., and Zhang, A. Look back to reason forward: Revis- itable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025a. Shi, Y ., Li, S., Wu, C., Liu, Z., Fang, J., Cai, H., Zhang, A., and Wang, X. Search and refine during think: Facilitating knowledge refinement for improved re...
-
[23]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W. X., Fang, L., and Wen, J.-R. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning
Wang, Y ., Takanobu, R., Liang, Z., Mao, Y ., Hu, Y ., McAuley, J., and Wu, X. Mem- {\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Wang, Y ., Jing, Y ., Liu, S., Guan, H., Tu, R.-c., Wang, C., Huang, J., and Tao, D. Vtc-r1: Vision-text compres- sion for efficient long-context reasoning.arXiv preprint arXiv:2601.22069,
-
[26]
DeepSeek-OCR: Contexts Optical Compression
Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
J., Yan, R., Shu, X., and Tang, J
Xing, L., Wang, A. J., Yan, R., Shu, X., and Tang, J. Vision- centric token compression in large language model.arXiv preprint arXiv:2502.00791,
-
[28]
Xue, Z., Zheng, L., Liu, Q., Li, Y ., Zheng, X., Ma, Z., and An, B. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479,
-
[29]
Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 2369–2380,
work page 2018
-
[32]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Yu, H., Chen, T., Feng, J., Chen, J., Dai, W., Yu, Q., Zhang, Y .-Q., Ma, W.-Y ., Liu, J., Wang, M., et al. Memagent: Re- shaping long-context llm with multi-conv rl-based mem- ory agent.arXiv preprint arXiv:2507.02259,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025a
Zhang, W., Li, X., Zhang, Y ., Jia, P., Wang, Y ., Guo, H., Liu, Y ., and Zhao, X. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025a. Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., and Wen, J.-R. A survey on the memory mechanism of large language model-based agents.ACM Transactions on In...
-
[34]
or distilled reward models (Bai et al., 2022), the field has gradually shifted toward rule-based feedback, demonstrating great potential in enhancing model capabilities. Key algorithmic contributions include proximal policy optimization (Schulman et al., 2017), based on generalized advantage estimation (Schulman et al., 2015), and GRPO (Shao et al., 2024)...
work page 2022
-
[35]
enables multi-turn searching by leveraging raw history throughout the reasoning process. The second paradigm adopts textual summary memory, where long-context information is compressed into concise text forms rather than retaining the full raw history (Duverger et al., 2024; Yu et al., 2025; Wang et al., 2025; Bian et al., 2026; Hu et al., 2026; Shi et al...
work page 2024
-
[36]
OCR for Context Compression.Optical Character Recognition (OCR) (Arlazarov et al., 2022; Smith,
trains the agent to manage complex hierarchical memory systems,i.e.,to extract, store, and update memory corpora of varying sizes and importance. OCR for Context Compression.Optical Character Recognition (OCR) (Arlazarov et al., 2022; Smith,
work page 2022
-
[37]
is a well- established technology that is widely utilized for extracting textual content embedded in image format. In recent research advances, OCR has beed explored as an innovative vision-text compression paradigm (Wei et al., 2025; Xing et al., 2025; Cheng et al., 2025). Unlike the conventional practice of directly inputting long context into LLMs, thi...
work page 2025
-
[38]
Given an input Markdown string, the module (i) normalizes the text by stripping leading/trailing whitespace and surrounding backticks, (ii) converts Markdown to HTML using the Python markdown library 3, (iii) wraps the generated HTML in a fixed, inlined CSS template, and (iv) renders the HTML in a headless Chromium page and returns a screenshot image. Bud...
work page 2048
-
[39]
A statistical significance analysis is conducted in Appendix D.1
14 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning We use sub-word exact match (SEM) as accuracy and report the mean scores over three independent runs with different random seeds. A statistical significance analysis is conducted in Appendix D.1. Unless otherwise noted, we use stocastic decoding (temperature= 0.7, top-p=0.95) for b...
work page 2025
-
[40]
The chunk size|C t|is set to 1,000 following the authors’ setup
to match the model size of other baselines. The chunk size|C t|is set to 1,000 following the authors’ setup. • Mem0(Chhikara et al., 2025): we run reproduction following the official documentation
work page 2025
-
[41]
Who has a wider scope of profession...?
The results indicate the image rendering process is very light-weighted, consuming only 1 second per 68 samples and a 0.175 extra latency. D.3. Additional Ablation Studies Motivation.Table 2 shows that removing RL from our 7B setting causes substantial degradation, especially under strict memory budgets. A natural question is whether simply scaling the ba...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.