MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

An Zhang; Hui Su; Qi Gu; Shugui Liu; Wenyu Mao; Xiang Wang; Xunliang Cai; Yaorui Shi; Yuxin Chen; Yu Yang

arxiv: 2601.21468 · v5 · pith:6UO6OYAKnew · submitted 2026-01-29 · 💻 cs.AI

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Yaorui Shi , Shugui Liu , Yu Yang , Wenyu Mao , Yuxin Chen , Qi GU , Hui Su , Xunliang Cai

show 2 more authors

Xiang Wang An Zhang

This is my paper

Pith reviewed 2026-05-21 15:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords memory compressionlong-horizon reasoningmultimodal agentsvisual layoutcontext budgetreinforcement learningquestion answeringagentic reasoning

0 comments

The pith

MemOCR renders structured memory as layout-rich images so agents can visually prioritize key evidence while compressing the rest under tight budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that serializing long interaction histories as plain text wastes limited context on low-value details because every token costs the same. Instead, MemOCR keeps a rich-text memory with headings and highlights, turns it into an image, and lets the agent read that image to decide what matters most. This visual step is meant to create adaptive density: important facts stay readable while everything else shrinks. The claim matters for any agent that must reason over many steps when the context window cannot grow, because better compression should produce higher accuracy on multi-hop and single-hop QA tasks even at extreme budget levels. The method is trained with reinforcement learning that exposes the agent to many different compression targets so the behavior stays stable.

Core claim

MemOCR maintains a structured rich-text memory and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details; the system is trained with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels.

What carries the argument

Layout-aware visual memory created by rendering structured rich-text (headings, highlights) into an image for direct visual consultation by the agent.

If this is right

Outperforms strong text-based baselines across long-context multi-hop and single-hop question-answering benchmarks.
Achieves more effective context utilization when memory budgets are pushed to extreme levels.
Allocates memory space with adaptive information density through visual layout rather than uniform token cost.
Remains robust when trained under budget-aware reinforcement learning that varies compression levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-rendering step could be tested on planning or tool-use agents that also accumulate long histories.
Hybrid text-plus-image memory might reduce the need for ever-larger context windows in general agent architectures.
If visual prioritization works, it suggests that future memory systems should treat layout and salience as first-class resources rather than after-the-fact compression.

Load-bearing premise

That turning structured memory into an image will let the agent reliably pick out crucial evidence and drop auxiliary details without losing what it needs, no matter how small the budget becomes.

What would settle it

On the same long-context QA benchmarks, measure whether accuracy and evidence-use scores drop below strong text baselines once the memory budget is forced below a fixed low threshold such as 512 tokens.

Figures

Figures reproduced from arXiv: 2601.21468 by An Zhang, Hui Su, Qi Gu, Shugui Liu, Wenyu Mao, Xiang Wang, Xunliang Cai, Yaorui Shi, Yuxin Chen, Yu Yang.

**Figure 1.** Figure 1: Comparison of memory paradigms. (a) Raw History Memory fetches relevant history passages but suffers from noise and redundancy. (b) Textual Summary Memory allows the agent to summarize the history but suffers from uniform information density, where auxiliary details (gray) consume as much token space as crucial information (green). (c) Visual Memory (Ours) allocates memory budget via visual layout to achie… view at source ↗

**Figure 2.** Figure 2: Framework of MemOCR. (a) Memory Drafting (Text Domain): The LLM agent incrementally updates a rich-text memory based on new incoming chunks, assigning visual priority via formatting and structure. (b) Memory Reading (Vision Domain): The rich text is rendered into a 2D memory image, which serves as the agent’s sole working context for answering queries. (c) Budget-Aware Training Objectives: We train the age… view at source ↗

**Figure 3.** Figure 3: Design of the budget-aware training objectives. (1) Standard QA uses the unmodified question and memory for global correctness. (2) QA w/ Augmented Memory requires the visibility of crucial evidence even when the visual memory is heavily compressed. (3) QA w/ Augmented Question ensures detailed information is clearly identified with sufficient tokens. The lowbudget, high-detail setting (gray area) is exc… view at source ↗

**Figure 4.** Figure 4: Comparison of accuracy and relative performance drop across varying memory budgets (RQ2). MemOCR degrades more gracefully than textual baselines as budgets tighten. Without visual layout, MemOCR’s low-budget robustness drops significantly, which suggests that adaptive information density facilitates more efficient memory budget utilization. This additional drop indicates that MemOCR’s robustness primaril… view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Case study at an extreme memory budget (16 tokens). (Left) The textual baseline fails due to hard truncation of the context. (Middle) MemOCR without layout control fails because uniform text becomes unreadable after down-sampling. (Right) MemOCR preserves the crucial evidence “Gene MacLellan” through adaptive layout, enabling correct reasoning even at low resolution. 4.5. Ablation Study over Training Objec… view at source ↗

**Figure 8.** Figure 8: Failure Mode Analysis under Resource Constraints (16-token budget). (Top) In comparative reasoning (i.e., to choose among two candidates), while the layout successfully highlights entity headers, the body text containing crucial attributes is compressed into unreadable noise during downsampling. (Bottom) When the rich-text memory length exceeds the visual canvas capacity, the forced font scaling drops belo… view at source ↗

read the original abstract

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemOCR renders structured memory as images to let agents prioritize evidence visually under tight budgets, but the gains look more like better structuring than a clear layout win.

read the letter

The paper's main move is to keep agent history as rich-text with headings and highlights, render that into an image, and let the multimodal model consult the image instead of raw text tokens. It adds RL training that varies the memory budget so the agent learns to compress across different limits. On long-context QA benchmarks it beats text baselines in both multi-hop and single-hop settings and uses context more efficiently at extreme compression levels.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MemOCR, a multimodal memory agent for long-horizon reasoning under tight context budgets. It maintains structured rich-text memory (headings, highlights), renders it to an image for visual consultation by the agent, and trains the system via reinforcement learning with budget-aware objectives to enable adaptive compression. The central claim is that this visual-layout approach allows prioritizing crucial evidence while aggressively compressing auxiliary details, yielding better performance than text-based baselines on long-context multi-hop and single-hop QA benchmarks.

Significance. If the core mechanism holds, the work offers a promising direction for efficient memory management in agentic systems by shifting from uniform text token costs to layout-driven visual density. The budget-aware RL training is a positive design choice for robustness. The approach could influence multimodal agent architectures if the visual prioritization proves reliable.

major comments (2)

[§3] §3 (Method): The rendering of structured rich-text into images and the mechanism by which the vision encoder/policy exploits layout cues (headings, highlights) for selective prioritization rather than uniform processing are not described in sufficient detail. No information is given on image resolution, scaling with budget, or how the RL objective specifically enforces layout-based attention. This is load-bearing for the central claim, as failure here would collapse the advantage over text baselines.
[§4] §4 (Experiments): The reported outperformance on QA benchmarks lacks ablations that isolate the visual layout component from other factors such as the RL objective or memory structure. Without these, it is unclear whether gains stem from the claimed visual prioritization or from other implementation choices.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit quantification of 'extreme budgets' (e.g., specific token or pixel limits) to ground the claims.
[Figures] Figure captions describing rendered memory examples could more clearly annotate how layout elements are preserved or compressed at different budgets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have prepared revisions to the manuscript to improve clarity and experimental rigor.

read point-by-point responses

Referee: [§3] §3 (Method): The rendering of structured rich-text into images and the mechanism by which the vision encoder/policy exploits layout cues (headings, highlights) for selective prioritization rather than uniform processing are not described in sufficient detail. No information is given on image resolution, scaling with budget, or how the RL objective specifically enforces layout-based attention. This is load-bearing for the central claim, as failure here would collapse the advantage over text baselines.

Authors: We agree that the current description in Section 3 is insufficiently detailed on these points. In the revised manuscript we will expand the method section to specify the rich-text to image rendering procedure (including layout preservation for headings and highlights), the image resolution and its scaling with available budget, the vision encoder's handling of visual density cues, and the precise formulation of the budget-aware RL objective that encourages layout-driven prioritization. These additions will directly support the central claim. revision: yes
Referee: [§4] §4 (Experiments): The reported outperformance on QA benchmarks lacks ablations that isolate the visual layout component from other factors such as the RL objective or memory structure. Without these, it is unclear whether gains stem from the claimed visual prioritization or from other implementation choices.

Authors: We acknowledge the validity of this concern. While the existing comparisons to text baselines demonstrate overall gains, we will add targeted ablations in the revised experiments section. These will include a text-only memory variant trained with the same RL procedure, removal of specific layout elements (headings/highlights), and controlled variations of the RL objective to isolate the contribution of visual layout prioritization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent RL training and benchmark evaluation

full rationale

The paper presents MemOCR as a new architecture that maintains structured rich-text memory, renders it to images, and trains the agent via reinforcement learning under budget-aware objectives to enable visual prioritization of evidence. Performance gains are reported as direct empirical outcomes on long-context multi-hop and single-hop QA benchmarks against text baselines. No equations, fitted parameters, or predictions are shown to reduce by construction to the inputs; the layout-exploitation assumption is tested rather than defined into existence, and no load-bearing self-citations or uniqueness theorems are invoked in the provided description. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.0 · 5718 in / 1006 out tokens · 41802 ms · 2026-05-21T15:10:19.183596+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 4 Pith papers · 20 internal anchors

[1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Realmem: Bench- marking llms in real-world memory-driven interaction

Bian, H., Yao, Z., Hu, S., Xu, Z., Zhang, S., Guo, Y ., Yang, Z., Han, X., Wang, H., and Chen, R. Realmem: Bench- marking llms in real-world memory-driven interaction. arXiv preprint arXiv:2601.06966,

work page arXiv
[4]

Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,

Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., et al. Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,

work page arXiv
[5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Chhikara, P., Khant, D., Aryan, S., Singh, T., and Yadav, D. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G.,...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

doi: 10.48550/ARXIV .2501.12948. Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025a. Du, Y ., Huang, W., Zheng, D., Wang, Z., Montella, S., Lapata, M., Wong, K.-F., and Pan, J. Z. Rethinking memory in ai: Taxonomy, operations, topics, and future dir...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[9]

LightMem: Lightweight and Efficient Memory-Augmented Generation

URL https://doi.org/10. 48550/arXiv.2510.18866. 9 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning Feng, L., Yang, F., Chen, F., Cheng, X., Xu, H., Wan, Z., Yan, M., and An, B. Agentocr: Reimagining agent history via optical self-compression.arXiv preprint arXiv:2601.04786,

work page internal anchor Pith review arXiv
[10]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Does mem- ory need graphs? a unified framework and empirical analysis for long-term dialog memory.arXiv preprint arXiv:2601.01280,

Hu, S., Wei, Y ., Ran, J., Yao, Z., and Zou, L. Does mem- ory need graphs? a unified framework and empirical analysis for long-term dialog memory.arXiv preprint arXiv:2601.01280,

work page arXiv
[12]

Memory in the Age of AI Agents

Hu, Y ., Liu, S., Yue, Y ., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, B., Zeng, H., Yue, Z., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and lever- age search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

MemOS: A Memory OS for AI System

Li, Z., Song, S., Xi, C., Wang, H., Tang, C., Niu, S., Chen, D., Yang, J., Li, C., Yu, Q., et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Luo, J., Zhang, W., Yuan, Y ., Zhao, Y ., Yang, J., Gu, Y ., Wu, B., Chen, B., Qiao, Z., Long, Q., et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

and Torlone, R

Matarazzo, A. and Torlone, R. A survey on large language models with some insights on their capabilities and limi- tations.arXiv preprint arXiv:2501.04040,

work page arXiv
[17]

A., Yoon, S., and Sch ¨utze, H

Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., and Sch ¨utze, H. Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,

work page arXiv
[18]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

When to mem- orize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

Sheng, L., Zhang, Y ., Ma, W., Shi, Y ., Huang, T., Wang, X., Zhang, A., Shen, K., and Chua, T.-S. When to mem- orize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

work page arXiv
[22]

Look back to reason forward: Revis- itable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025a

Shi, Y ., Chen, Y ., Wang, S., Li, S., Cai, H., Gu, Q., Wang, X., and Zhang, A. Look back to reason forward: Revis- itable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025a. Shi, Y ., Li, S., Wu, C., Liu, Z., Fang, J., Cai, H., Zhang, A., and Wang, X. Search and refine during think: Facilitating knowledge refinement for improved re...

work page arXiv 2007
[23]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W. X., Fang, L., and Wen, J.-R. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Wang, Y ., Takanobu, R., Liang, Z., Mao, Y ., Hu, Y ., McAuley, J., and Wu, X. Mem- {\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Vtc-r1: Vision-text compres- sion for efficient long-context reasoning.arXiv preprint arXiv:2601.22069,

Wang, Y ., Jing, Y ., Liu, S., Guan, H., Tu, R.-c., Wang, C., Huang, J., and Tao, D. Vtc-r1: Vision-text compres- sion for efficient long-context reasoning.arXiv preprint arXiv:2601.22069,

work page arXiv
[26]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

J., Yan, R., Shu, X., and Tang, J

Xing, L., Wang, A. J., Yan, R., Shu, X., and Tang, J. Vision- centric token compression in large language model.arXiv preprint arXiv:2502.00791,

work page arXiv
[28]

thinking with long videos

Xue, Z., Zheng, L., Liu, Q., Li, Y ., Zheng, X., Ma, Z., and An, B. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479,

work page arXiv
[29]

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 2369–2380,

work page 2018
[32]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Yu, H., Chen, T., Feng, J., Chen, J., Dai, W., Yu, Q., Zhang, Y .-Q., Ma, W.-Y ., Liu, J., Wang, M., et al. Memagent: Re- shaping long-context llm with multi-conv rl-based mem- ory agent.arXiv preprint arXiv:2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025a

Zhang, W., Li, X., Zhang, Y ., Jia, P., Wang, Y ., Guo, H., Liu, Y ., and Zhao, X. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025a. Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., and Wen, J.-R. A survey on the memory mechanism of large language model-based agents.ACM Transactions on In...

work page arXiv
[34]

or distilled reward models (Bai et al., 2022), the field has gradually shifted toward rule-based feedback, demonstrating great potential in enhancing model capabilities. Key algorithmic contributions include proximal policy optimization (Schulman et al., 2017), based on generalized advantage estimation (Schulman et al., 2015), and GRPO (Shao et al., 2024)...

work page 2022
[35]

enables multi-turn searching by leveraging raw history throughout the reasoning process. The second paradigm adopts textual summary memory, where long-context information is compressed into concise text forms rather than retaining the full raw history (Duverger et al., 2024; Yu et al., 2025; Wang et al., 2025; Bian et al., 2026; Hu et al., 2026; Shi et al...

work page 2024
[36]

OCR for Context Compression.Optical Character Recognition (OCR) (Arlazarov et al., 2022; Smith,

trains the agent to manage complex hierarchical memory systems,i.e.,to extract, store, and update memory corpora of varying sizes and importance. OCR for Context Compression.Optical Character Recognition (OCR) (Arlazarov et al., 2022; Smith,

work page 2022
[37]

crucial evidence

is a well- established technology that is widely utilized for extracting textual content embedded in image format. In recent research advances, OCR has beed explored as an innovative vision-text compression paradigm (Wei et al., 2025; Xing et al., 2025; Cheng et al., 2025). Unlike the conventional practice of directly inputting long context into LLMs, thi...

work page 2025
[38]

Budget-to-resolution mapping.Given a memory budget B (invisual tokens), we resize the rendered memory image to a target resolution such that the vision encoder produces ⩽B tokens

Given an input Markdown string, the module (i) normalizes the text by stripping leading/trailing whitespace and surrounding backticks, (ii) converts Markdown to HTML using the Python markdown library 3, (iii) wraps the generated HTML in a fixed, inlined CSS template, and (iv) renders the HTML in a headless Chromium page and returns a screenshot image. Bud...

work page 2048
[39]

A statistical significance analysis is conducted in Appendix D.1

14 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning We use sub-word exact match (SEM) as accuracy and report the mean scores over three independent runs with different random seeds. A statistical significance analysis is conducted in Appendix D.1. Unless otherwise noted, we use stocastic decoding (temperature= 0.7, top-p=0.95) for b...

work page 2025
[40]

The chunk size|C t|is set to 1,000 following the authors’ setup

to match the model size of other baselines. The chunk size|C t|is set to 1,000 following the authors’ setup. • Mem0(Chhikara et al., 2025): we run reproduction following the official documentation

work page 2025
[41]

Who has a wider scope of profession...?

The results indicate the image rendering process is very light-weighted, consuming only 1 second per 68 samples and a 0.175 extra latency. D.3. Additional Ablation Studies Motivation.Table 2 shows that removing RL from our 7B setting causes substantial degradation, especially under strict memory budgets. A natural question is whether simply scaling the ba...

work page 2000

[1] [1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Realmem: Bench- marking llms in real-world memory-driven interaction

Bian, H., Yao, Z., Hu, S., Xu, Z., Zhang, S., Guo, Y ., Yang, Z., Han, X., Wang, H., and Chen, R. Realmem: Bench- marking llms in real-world memory-driven interaction. arXiv preprint arXiv:2601.06966,

work page arXiv

[4] [4]

Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,

Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., et al. Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,

work page arXiv

[5] [5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Chhikara, P., Khant, D., Aryan, S., Singh, T., and Yadav, D. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G.,...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

doi: 10.48550/ARXIV .2501.12948. Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025a. Du, Y ., Huang, W., Zheng, D., Wang, Z., Montella, S., Lapata, M., Wong, K.-F., and Pan, J. Z. Rethinking memory in ai: Taxonomy, operations, topics, and future dir...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[8] [9]

LightMem: Lightweight and Efficient Memory-Augmented Generation

URL https://doi.org/10. 48550/arXiv.2510.18866. 9 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning Feng, L., Yang, F., Chen, F., Cheng, X., Xu, H., Wan, Z., Yan, M., and An, B. Agentocr: Reimagining agent history via optical self-compression.arXiv preprint arXiv:2601.04786,

work page internal anchor Pith review arXiv

[9] [10]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

Does mem- ory need graphs? a unified framework and empirical analysis for long-term dialog memory.arXiv preprint arXiv:2601.01280,

Hu, S., Wei, Y ., Ran, J., Yao, Z., and Zou, L. Does mem- ory need graphs? a unified framework and empirical analysis for long-term dialog memory.arXiv preprint arXiv:2601.01280,

work page arXiv

[11] [12]

Memory in the Age of AI Agents

Hu, Y ., Liu, S., Yue, Y ., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, B., Zeng, H., Yue, Z., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and lever- age search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

MemOS: A Memory OS for AI System

Li, Z., Song, S., Xi, C., Wang, H., Tang, C., Niu, S., Chen, D., Yang, J., Li, C., Yu, Q., et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Luo, J., Zhang, W., Yuan, Y ., Zhao, Y ., Yang, J., Gu, Y ., Wu, B., Chen, B., Qiao, Z., Long, Q., et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

and Torlone, R

Matarazzo, A. and Torlone, R. A survey on large language models with some insights on their capabilities and limi- tations.arXiv preprint arXiv:2501.04040,

work page arXiv

[16] [17]

A., Yoon, S., and Sch ¨utze, H

Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., and Sch ¨utze, H. Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,

work page arXiv

[17] [18]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [21]

When to mem- orize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

Sheng, L., Zhang, Y ., Ma, W., Shi, Y ., Huang, T., Wang, X., Zhang, A., Shen, K., and Chua, T.-S. When to mem- orize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

work page arXiv

[21] [22]

Look back to reason forward: Revis- itable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025a

Shi, Y ., Chen, Y ., Wang, S., Li, S., Cai, H., Gu, Q., Wang, X., and Zhang, A. Look back to reason forward: Revis- itable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025a. Shi, Y ., Li, S., Wu, C., Liu, Z., Fang, J., Cai, H., Zhang, A., and Wang, X. Search and refine during think: Facilitating knowledge refinement for improved re...

work page arXiv 2007

[22] [23]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W. X., Fang, L., and Wen, J.-R. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [24]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Wang, Y ., Takanobu, R., Liang, Z., Mao, Y ., Hu, Y ., McAuley, J., and Wu, X. Mem- {\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

Vtc-r1: Vision-text compres- sion for efficient long-context reasoning.arXiv preprint arXiv:2601.22069,

Wang, Y ., Jing, Y ., Liu, S., Guan, H., Tu, R.-c., Wang, C., Huang, J., and Tao, D. Vtc-r1: Vision-text compres- sion for efficient long-context reasoning.arXiv preprint arXiv:2601.22069,

work page arXiv

[25] [26]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [27]

J., Yan, R., Shu, X., and Tang, J

Xing, L., Wang, A. J., Yan, R., Shu, X., and Tang, J. Vision- centric token compression in large language model.arXiv preprint arXiv:2502.00791,

work page arXiv

[27] [28]

thinking with long videos

Xue, Z., Zheng, L., Liu, Q., Li, Y ., Zheng, X., Ma, Z., and An, B. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479,

work page arXiv

[28] [29]

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 2369–2380,

work page 2018

[31] [32]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Yu, H., Chen, T., Feng, J., Chen, J., Dai, W., Yu, Q., Zhang, Y .-Q., Ma, W.-Y ., Liu, J., Wang, M., et al. Memagent: Re- shaping long-context llm with multi-conv rl-based mem- ory agent.arXiv preprint arXiv:2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025a

Zhang, W., Li, X., Zhang, Y ., Jia, P., Wang, Y ., Guo, H., Liu, Y ., and Zhao, X. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025a. Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., and Wen, J.-R. A survey on the memory mechanism of large language model-based agents.ACM Transactions on In...

work page arXiv

[33] [34]

or distilled reward models (Bai et al., 2022), the field has gradually shifted toward rule-based feedback, demonstrating great potential in enhancing model capabilities. Key algorithmic contributions include proximal policy optimization (Schulman et al., 2017), based on generalized advantage estimation (Schulman et al., 2015), and GRPO (Shao et al., 2024)...

work page 2022

[34] [35]

enables multi-turn searching by leveraging raw history throughout the reasoning process. The second paradigm adopts textual summary memory, where long-context information is compressed into concise text forms rather than retaining the full raw history (Duverger et al., 2024; Yu et al., 2025; Wang et al., 2025; Bian et al., 2026; Hu et al., 2026; Shi et al...

work page 2024

[35] [36]

OCR for Context Compression.Optical Character Recognition (OCR) (Arlazarov et al., 2022; Smith,

trains the agent to manage complex hierarchical memory systems,i.e.,to extract, store, and update memory corpora of varying sizes and importance. OCR for Context Compression.Optical Character Recognition (OCR) (Arlazarov et al., 2022; Smith,

work page 2022

[36] [37]

crucial evidence

is a well- established technology that is widely utilized for extracting textual content embedded in image format. In recent research advances, OCR has beed explored as an innovative vision-text compression paradigm (Wei et al., 2025; Xing et al., 2025; Cheng et al., 2025). Unlike the conventional practice of directly inputting long context into LLMs, thi...

work page 2025

[37] [38]

Budget-to-resolution mapping.Given a memory budget B (invisual tokens), we resize the rendered memory image to a target resolution such that the vision encoder produces ⩽B tokens

Given an input Markdown string, the module (i) normalizes the text by stripping leading/trailing whitespace and surrounding backticks, (ii) converts Markdown to HTML using the Python markdown library 3, (iii) wraps the generated HTML in a fixed, inlined CSS template, and (iv) renders the HTML in a headless Chromium page and returns a screenshot image. Bud...

work page 2048

[38] [39]

A statistical significance analysis is conducted in Appendix D.1

14 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning We use sub-word exact match (SEM) as accuracy and report the mean scores over three independent runs with different random seeds. A statistical significance analysis is conducted in Appendix D.1. Unless otherwise noted, we use stocastic decoding (temperature= 0.7, top-p=0.95) for b...

work page 2025

[39] [40]

The chunk size|C t|is set to 1,000 following the authors’ setup

to match the model size of other baselines. The chunk size|C t|is set to 1,000 following the authors’ setup. • Mem0(Chhikara et al., 2025): we run reproduction following the official documentation

work page 2025

[40] [41]

Who has a wider scope of profession...?

The results indicate the image rendering process is very light-weighted, consuming only 1 second per 68 samples and a 0.175 extra latency. D.3. Additional Ablation Studies Motivation.Table 2 shows that removing RL from our 7B setting causes substantial degradation, especially under strict memory budgets. A natural question is whether simply scaling the ba...

work page 2000