pith. machine review for the scientific record. sign in

arxiv: 2604.04373 · v1 · submitted 2026-04-06 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Decocted Experience Improves Test-Time Inference in LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:45 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM agentstest-time inferencedecocted experiencecontext constructionexperience augmentationreasoning tasksagentic tasks
0
0 comments X

The pith

Decocted experience, by distilling past interactions into coherent context, improves LLM agent performance at test time without extra compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines ways to boost LLM agents on complex tasks by improving the input context they receive rather than scaling raw inference compute like longer chains or more samples. It demonstrates that simply accumulating experience is not enough; the experience must be processed into a decocted form that extracts core insights, organizes them logically, and allows retrieval of relevant parts. This context construction approach is analyzed for how it derives guidance from experience, how results improve as experience grows, what makes context effective, and which data structures help most. The findings are checked on math reasoning, web browsing, and software engineering tasks. A reader would care because it points to a lower-cost path for better agent behavior using only existing models and smarter use of history.

Core claim

Effective context construction for experience-augmented LLM agents critically depends on decocted experience: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. The paper studies how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction, validating the approach across reasoning and agentic tasks including math reasoning, web browsing, and software engineering.

What carries the argument

Decocted experience, the process of extracting the essence from accumulated experience, organizing it coherently, and retrieving salient information to construct effective context for guiding the model's reasoning.

If this is right

  • LLM agents achieve higher performance when context is built from decocted experience instead of raw accumulated logs.
  • Performance improves as more experience is accumulated provided it is decocted into coherent, retrievable form.
  • Data structures that enable coherent organization and salient retrieval are required for effective context construction.
  • The benefits appear on mathematical reasoning, web browsing, and software engineering tasks.
  • Context construction serves as a complementary scaling axis to increased test-time computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designs may benefit from dedicated modules that automatically distill raw interaction history before storage.
  • The emphasis on distillation over volume suggests similar processing could help in-context learning with long histories.
  • Retrieval-augmented systems could see gains by applying decocting to their external knowledge stores before use.

Load-bearing premise

The observed performance gains come specifically from the decocted form of the experience rather than from total experience volume, retrieval method, or task-specific prompt engineering.

What would settle it

An experiment that keeps experience volume, retrieval method, and prompts fixed but replaces decocted experience with raw unprocessed logs and finds no performance difference on math reasoning, web browsing, or software engineering tasks.

Figures

Figures reproduced from arXiv: 2604.04373 by Gregory Wornell, J. Jon Ryu, Kaiwen Zha, Maohao Shen, Prasanna Sattigeri, Siru Ouyang, Suhas Diggavi, Zexue He, Zhang-Wei Hong.

Figure 1
Figure 1. Figure 1: Experience-Augmented Agent. The agent accumulates experience from past interactions, decocts it into effective context for improved inference at test time, i.e., distilling lessons from experi￾ence, organizing the experience memory, and finally retrieving salient information from it. A more natural source of high-quality context is experience, e.g., through interaction with environments, an agent accumulat… view at source ↗
Figure 2
Figure 2. Figure 2: Raw Experience vs. Distilled Lesson as Context. Both context construction approaches significantly outperform the vanilla agent without context. Raw experience is slightly stronger for mathematical reasoning, while distilled lessons yield better performance in agentic tasks (WebShop & SWE), where trajectory-level observations are noisier and distillation helps. Lesson Distillation Enables More Favorable Co… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling Behavior. Agent’s performance as a function of input context length when increasing K in Top-K retrieval. Distilled lessons achieve better performance with fewer input tokens and remain more robust as context grows. In contrast, raw experience can degrade when the prompt becomes excessively long and noisy. Overall, lesson distillation acts as an effective context compression mechanism that extracts… view at source ↗
Figure 4
Figure 4. Figure 4: Experience Scaling Behavior via Memory Consolidation. We evaluate experience memory consolidation across varying memory sizes. The red dashed line indicates full memory baseline performance. Memory consolidation achieves a sweet spot at intermediate sizes. between the informativeness of context and effectiveness of inference. Here, we provide a quantitative yet intuitive relationship between them via infor… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical validation of Proposition 4.1. Figure (a)’s strong linear correlation confirms that Hˆ (Y | X = x, C = c) tightly predicts the expected output length up to a constant. Figure (b) shows that relevant context (retrieved lessons) yields higher information gain than random context. 0.20 0.25 0.30 0.35 Context Relevance Score + × Diversity Score 0.0 0.2 0.4 0.6 0.8 a v g _ m Pearson Correlation r = 0.… view at source ↗
Figure 6
Figure 6. Figure 6: Correlation between Context Quality and Performance Improvement. (a) The relationship between the context quality score and the improvement measured in ∆avg m(x). The positive Pearson correlation indicates that higher-quality contexts tend to yield larger performance gains. (b) Pearson correlation as the coefficient λ varies. The correlation peaks around λ = 0.6, suggesting that balancing relevance and div… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of Concept Tree (WebShop). Hierarchical concept tree constructed from 1,135 experience records on WebShop task. The tree organizes standard lesson-based memory entries into 8 broad topics and 44 concrete concept groups via hierarchical clustering. Each leaf node contains a collection of lessons, with leaf segment widths proportional to the number of records. b children, where the branching fa… view at source ↗
Figure 8
Figure 8. Figure 8: (a) shows that the concept tree leads to improved inference performance [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Concept Tree (Math Reasoning & SWE). (a) Hierarchical concept tree constructed from 13,381 experience records on the math reasoning task. The tree organizes memories into 17 broad topics and 256 concrete concept groups via two levels of embedding-based clustering, with segment widths proportional to the number of records in each cluster. For visual clarity, only the four largest leaf clust… view at source ↗
Figure 10
Figure 10. Figure 10: Hierarchical Concept Tree vs. Lesson Retrieval Performance comparing the hierarchical concept tree against Top-K lesson retrieval and vanilla baselines. retrieval over a flat lesson memory is often sufficient to identify useful examples. In contrast, on SWE, the concept tree still delivers a noticeable improvement over the baseline, consistent with our observation on the WebShop task ( [PITH_FULL_IMAGE:f… view at source ↗
Figure 11
Figure 11. Figure 11: Analysis Results on GPT-OSS-20B (WebShop). The overall trends are consistent with the main results in Section 3. (a) Distilled lessons provide more effective context than raw experience. (b) Memory consolidation shows a sweet spot at intermediate memory sizes. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emph{context} as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emph{experience}. We show that effective context construction critically depends on \emph{decocted experience}. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emph{decocted experience} as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that effective context construction for LLM agents depends critically on 'decocted experience'—extracting essence from experience, organizing it coherently, and retrieving salient information. The authors analyze experience-augmented agents by studying context derivation from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. They position context as a complementary scaling axis to test-time compute and validate the findings across math reasoning, web browsing, and software engineering tasks.

Significance. If the empirical results hold after addressing controls, the work could be significant for LLM agent research by identifying a mechanism for leveraging experience to improve inference efficiency without parameter updates or naive compute scaling. The multi-domain validation and focus on context construction mechanisms provide a foundation for more principled agent design and could influence future work on experience management in AI systems.

major comments (2)
  1. Abstract: The abstract asserts validation across math reasoning, web browsing, and software engineering tasks but supplies no details on baselines, metrics, statistical tests, or experimental controls, preventing assessment of whether the data supports the central claim.
  2. Experiments section: The central claim requires that performance gains are specifically attributable to the decocted form of experience (essence extraction + coherent organization + salient retrieval) rather than confounds such as total experience volume (token count), retrieval method, or task-specific prompt engineering. No ablations or matched baselines isolating these factors are described.
minor comments (2)
  1. Introduction: The neologism 'decocted experience' would benefit from an explicit early definition or etymological note to aid reader comprehension.
  2. Related work: Consider citing prior literature on context compression in LLMs and experience replay mechanisms in agentic RL to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the clarity and rigor of our claims. We address each major comment below and have prepared revisions to the manuscript.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts validation across math reasoning, web browsing, and software engineering tasks but supplies no details on baselines, metrics, statistical tests, or experimental controls, preventing assessment of whether the data supports the central claim.

    Authors: We agree that the abstract would benefit from greater specificity to help readers evaluate the claims at a glance. In the revised version, we have expanded the abstract to reference the main baselines (standard ReAct-style agents and raw experience accumulation), primary metrics (task success rate and inference efficiency), and note that results are reported as averages over multiple runs with statistical details provided in the Experiments section. Space constraints limit full enumeration of controls in the abstract, but these are now explicitly summarized there. revision: yes

  2. Referee: Experiments section: The central claim requires that performance gains are specifically attributable to the decocted form of experience (essence extraction + coherent organization + salient retrieval) rather than confounds such as total experience volume (token count), retrieval method, or task-specific prompt engineering. No ablations or matched baselines isolating these factors are described.

    Authors: We acknowledge this point and the need for tighter isolation of the decocted experience mechanism. While the original experiments include comparisons to raw experience and alternative context formats, they did not fully control for matched token budgets or systematically vary retrieval methods across domains. We have conducted additional ablation studies that (i) truncate experience to equal token lengths, (ii) compare semantic, keyword, and random retrieval under fixed budgets, and (iii) hold prompt engineering constant. These results, which support the specific contribution of decocting, will be added to the Experiments section with corresponding tables and analysis in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical analysis without derivations or self-referential reductions

full rationale

The paper is an empirical study of experience-augmented LLM agents that identifies 'decocted experience' via analysis of how context is derived from experience, performance scaling, and validation across math, web, and SE tasks. No equations, parameters, or derivation chains appear that reduce by construction to inputs; claims rest on experimental observations rather than self-definition, fitted predictions, or self-citation load-bearing steps. The work is self-contained against external benchmarks with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical observation that decocted experience improves context construction. No free parameters, mathematical axioms, or additional invented entities beyond the introduced concept are specified in the abstract.

invented entities (1)
  • decocted experience no independent evidence
    purpose: Extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context for LLM agents
    Introduced in the abstract as the key mechanism enabling better context construction; no independent evidence or external validation is provided beyond the paper's own experiments.

pith-pipeline@v0.9.0 · 5536 in / 1219 out tokens · 48030 ms · 2026-05-10T19:45:42.306352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SkillOS: Learning Skill Curation for Self-Evolving Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillOS is an RL recipe that learns to curate reusable skills for self-evolving LLM agents, outperforming memory-free and memory-based baselines while generalizing across executors and domains.

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D

    15 Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052, 2024. 15 Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: ...

  2. [2]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents

    2, 16 Yu Wang and Xi Chen. MIRIX: Multi-agent memory system for LLM-based agents.arXiv preprint arXiv:2507.07957, 2025. 15 Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-α: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025. 15 Zora Zhiruo Wang, Jiayuan Mao, D...

  3. [3]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    2, 15, 16 Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models. InAdvances in Neural Information Processing Systems, 2024. 1 Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang ...

  4. [4]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    17 Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380, 2018. 15 Shunyu Yao, Howard Chen, John Yang, and Karthi...

  5. [5]

    Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

    1 Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025a. 15 Kai Zhang, Xiangchao Chen, Bo Liu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025b. 16 Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar K...

  6. [6]

    A more recent line of work seeks to train LLMs tolearnmemory management

    builds a multi-agent controller that routes across specialized memory types such as episodic, semantic, and procedural memory. A more recent line of work seeks to train LLMs tolearnmemory management. MemGen (Zhang et al., 2025a) interleaves reasoning with generated latent memory, whereas Mem-α (Wang et al., 2025), MemAgent (Yu et al., 2025a), and Memory-R...

  7. [7]

    Task Description: A one-two sentence summary of the type of problems this strategy applies to

  8. [8]

    Strategy: A step-by-step detailed problem-solving strategy that could consist of various different ways to tackle similar problems

  9. [9]

    # Important: The strategy should be extremely detailed, covering multiple different ways to solve the problem

    Pitfalls: Common mistakes or misconceptions to avoid when solving this type of problems (if applicable). # Important: The strategy should be extremely detailed, covering multiple different ways to solve the problem. Prompt Template: Lesson Distillation (WebShop) You are a shopping strategy synthesizer analyzing past web shopping attempts. Review the shopp...

  10. [10]

    Task Description: A one-two sentence summary of the type of shopping task this strategy applies to

  11. [11]

    Action Workflow: A step-by-step detailed problem-solving strategy to complete the purchase, including step-level actions and the summarization of the observations at each step

  12. [12]

    Pitfalls: Common mistakes or pitfalls to avoid when solving this type of shopping task (if applicable). 19 Decocted Experience Improves Test-Time Inference in LLM Agents # Important: The workflow should include clear action steps (e.g., CLICK, TYPE, SELECT) and observations at each step. Prompt Template: Lesson Distillation (SWE) You are a debugging strat...

  13. [13]

    Task Description: One-two sentence summary of the issue type and code base area

  14. [14]

    Strategy: Step-by-step detailed agent workflow structured as an action sequence (followed by successful attempts)

  15. [15]

    number theory

    Pitfalls: Common mistakes to avoid. If a failed attempt is available, explain the specific wrong turn and how to avoid it. # Important: The workflow should include clear action steps and observations at each step. Prompt Template: Experience-based Inference You are an expert at solving complex reasoning problems, while leveraging few-shot experience. You ...