pith. sign in

arxiv: 2604.07487 · v1 · submitted 2026-04-08 · 💻 cs.AI

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords context augmentationcontrastive analysisagentic reflectionlanguage model agentsreinforcement learninggenerative contexttask execution
0
0 comments X

The pith

Agents learn to generate task-specific context from contrastive analysis of their own past execution trajectories instead of retrieving prior summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model agents typically rely on retrieval of context generated from past tasks, which then requires the agent to adapt that context to new situations and adds reasoning load. CLEAR introduces a generative alternative in which a reflection agent first contrasts successful and unsuccessful trajectories for each task to produce targeted summaries. These summaries become supervised fine-tuning data for a Context Augmentation Model that is later refined by reinforcement learning whose reward comes directly from the execution agent's task performance. The trained model therefore produces context tailored to the current task rather than reused from memory. Experiments on AppWorld and WebShop confirm higher task completion rates and rewards than strong retrieval baselines.

Core claim

CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). CAM is further optimized using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task.

What carries the argument

The Context Augmentation Model (CAM), trained first on summaries from contrastive reflection of past trajectories and then optimized by reinforcement learning driven by task-execution rewards.

If this is right

  • Task completion rate on AppWorld rises from 72.62 percent to 81.15 percent.
  • Average reward on a WebShop subset rises from 0.68 to 0.74.
  • The execution agent receives context already adapted to the current task and therefore carries less adaptation reasoning.
  • Context provision no longer depends on retrieval from a growing store of past experiences.
  • The overall agent pipeline improves consistently across the evaluated benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reflection-plus-generation loop could reduce the size of memory stores agents must maintain for long-horizon work.
  • If reflection quality can be maintained on novel task distributions, the approach may extend to domains outside the current benchmarks.
  • Iterative improvement of the reflection component could steadily raise the quality of training data for the context model.
  • The separation of reflection, generation, and execution stages may allow independent scaling of each piece in larger agent systems.

Load-bearing premise

The reflection agent's contrastive summaries must supply high-quality, unbiased task knowledge that serves as reliable training data without introducing errors that the reinforcement-learning stage then amplifies.

What would settle it

On a new set of tasks, measuring whether the CLEAR agent completes fewer tasks or earns lower rewards than an otherwise identical retrieval baseline would directly test whether generated context improves performance.

Figures

Figures reproduced from arXiv: 2604.07487 by Guande Wu, Han Ding, Huan Song, Linbo Liu, Lin Lee Cheong, Panpan Xu, Qiang Zhou, Yawei Wang, Yuzhe Lu, Zhichao Xu.

Figure 1
Figure 1. Figure 1: CLEAR training framework design. First, we execute each task qi ∼ Dtrain for m times and collect groups of replay Γi for qi . We employ reflection agents π R to generate instance-level instruction ci for each qi , collected into DSFT. Then, we initialize CAM from an open-source LLM and perform SFT on DSFT. Finally, we further perform RL on the trained CAM, which leverages the reward signal from π E for pol… view at source ↗
Figure 2
Figure 2. Figure 2: During inference, a new task qnew is sampled from Dtest and is passed into π C θ to generate cnew ∼ π C θ (qnew). The auxiliary context cnew is appended to qnew and the execution agent π E starts with qnew ⊕ cnew. To achieve this, we introduce a reflection agent π R that performs contrastive analysis over the replay buffer Γ for data generation. Its objective is to extract high-value insights that explain … view at source ↗
read the original abstract

Large language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at https://github.com/awslabs/CLEAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CLEAR, a generative context augmentation framework for LLM-based agents. A reflection agent performs contrastive analysis over past execution trajectories to produce task-specific summaries; these serve as SFT data for a Context Augmentation Model (CAM). CAM is then further optimized via policy-gradient RL whose reward is the downstream task-execution success metric. The central claim is that generating tailored context (rather than retrieving past context) reduces reasoning burden on the execution agent and yields better performance. Experiments report task-completion gains from 72.62% to 81.15% on AppWorld test and average-reward gains from 0.68 to 0.74 on a WebShop subset versus baselines.

Significance. If the central claim holds after addressing the RL-fidelity concerns, the work would provide a concrete, reproducible method for shifting context engineering from retrieval to generation, with measurable gains on two standard agent benchmarks and publicly released code. This could influence subsequent agent architectures that rely on dynamic context.

major comments (3)
  1. [§4.3] §4.3 (RL stage of CAM training): The reward is defined solely as the execution agent's task-success metric with no auxiliary term that penalizes divergence from the reflection-agent distribution or measures context fidelity. This leaves open the possibility that policy-gradient updates reinforce fabricated or incomplete context that happens to increase short-term execution reward, directly undermining the claim that CAM produces 'better tailored' context.
  2. [Experimental results] Experimental results (AppWorld and WebShop sections): The reported improvements (72.62% → 81.15% completion; 0.68 → 0.74 reward) are presented without ablations separating SFT-only CAM from RL-optimized CAM, without context-quality metrics (e.g., factual accuracy or completeness audits of generated summaries), and without statistical significance tests or variance across runs. These omissions make it impossible to determine whether the gains stem from genuine task-tailoring or from reward hacking.
  3. [§3.2] §3.2 (Contrastive reflection): The reflection agent's contrastive summaries are treated as high-quality, unbiased training targets for SFT, yet no human or automated validation of summary faithfulness to the original trajectories is reported. If these summaries already contain systematic biases or omissions, both SFT and subsequent RL will propagate them.
minor comments (2)
  1. [Method] Notation for the CAM policy and the reflection agent is introduced without a consolidated table of symbols, making it harder to track which components are frozen during RL.
  2. [Abstract / Experiments] The abstract states 'comprehensive evaluations' but supplies no information on the number of runs, random seeds, or exact baseline implementations; this should be clarified in the experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the CLEAR framework. We address each major comment point-by-point below, providing clarifications on our design choices and indicating revisions where the manuscript will be updated to strengthen the claims.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (RL stage of CAM training): The reward is defined solely as the execution agent's task-success metric with no auxiliary term that penalizes divergence from the reflection-agent distribution or measures context fidelity. This leaves open the possibility that policy-gradient updates reinforce fabricated or incomplete context that happens to increase short-term execution reward, directly undermining the claim that CAM produces 'better tailored' context.

    Authors: We acknowledge the validity of this concern: the RL reward is purely the downstream task-success metric, which could theoretically encourage reward hacking if the SFT initialization is insufficient. Our design choice was to let the task metric directly optimize the ultimate goal (agent performance) while relying on the contrastive SFT data to anchor generation to trajectory-derived knowledge. However, we agree this leaves a gap in fidelity guarantees. In the revision we will add an explicit discussion of this risk, include an auxiliary KL-divergence term to the reflection distribution in the reported RL experiments, and report measured divergence statistics. revision: partial

  2. Referee: [Experimental results] Experimental results (AppWorld and WebShop sections): The reported improvements (72.62% → 81.15% completion; 0.68 → 0.74 reward) are presented without ablations separating SFT-only CAM from RL-optimized CAM, without context-quality metrics (e.g., factual accuracy or completeness audits of generated summaries), and without statistical significance tests or variance across runs. These omissions make it impossible to determine whether the gains stem from genuine task-tailoring or from reward hacking.

    Authors: We agree these omissions weaken interpretability. The original experiments focused on end-to-end comparison of full CLEAR against baselines, but did not isolate SFT versus RL contributions or provide quality audits. In the revised manuscript we will add: (i) an ablation table comparing SFT-only CAM against the RL stage, (ii) automated context-quality metrics (entailment and completeness scores via LLM-as-judge against source trajectories), and (iii) mean ± std results over multiple runs with statistical significance tests. These additions will clarify whether gains arise from tailoring or other factors. revision: yes

  3. Referee: [§3.2] §3.2 (Contrastive reflection): The reflection agent's contrastive summaries are treated as high-quality, unbiased training targets for SFT, yet no human or automated validation of summary faithfulness to the original trajectories is reported. If these summaries already contain systematic biases or omissions, both SFT and subsequent RL will propagate them.

    Authors: The reflection agent is prompted to extract task-useful context via explicit contrastive analysis, but we did not report quantitative faithfulness validation in the submission. This is a genuine limitation of the current evidence. We will revise §3.2 to include both a human evaluation on a random sample of summaries (faithfulness and completeness ratings) and an automated consistency check against the original trajectories. Any identified biases will be discussed along with mitigation approaches. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external held-out benchmarks

full rationale

The paper describes an algorithmic pipeline (reflection agent produces summaries from past trajectories; these serve as SFT targets for CAM; CAM is then RL-tuned with downstream task-execution reward). All reported gains are measured on separate test sets (AppWorld test set, WebShop subset) using task-completion rate and averaged reward supplied by an independent execution agent. No mathematical equations, self-definitional identities, or fitted parameters are presented whose outputs are then relabeled as predictions. The central claim therefore rests on external benchmark signals rather than reducing to quantities defined inside the paper's own training loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard supervised and reinforcement learning components; no additional ledger items can be identified from available text.

pith-pipeline@v0.9.0 · 5574 in / 1324 out tokens · 47087 ms · 2026-05-10T18:02:02.900923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    error|Error|ERROR|exception|Exception|failed|Failed

    = 20 times, which effectively expands the training dataset by a factor of 20. E.2 Supervised Fine-T uning We further randomly split DSFT into 80% for training and 20% for validation. The CAM πC θ (·) is initialized from a Qwen/Qwen3-32B model (Yang et al., 2025), downloaded from HuggingFace9. We perform full-parameter fine-tuning for 5 epochs using 8 NVID...

  2. [2]

    **Identify successful vs failed traces** using reward/success metrics

  3. [3]

    **Compare agent behaviors** between successful and failed cases

  4. [4]

    **Find patterns** that differentiate success from failure

  5. [5]

    $f: " grep -oE '

    **Generate insights** about what works and what doesn 't ## Output Format Your final output should be a task-specific guidance in the following format: <guidance> The strategy for completing this task is ... </guidance> 20 Preprint. F.2 User Prompt User Prompt for the Reflection Agent You are given a folder that contains multiple LLM agent trajectory trac...

  6. [6]

    **Differentiating behaviors**: What do successful agents do that failed ones don't?

  7. [7]

    **Common pitfalls**: What mistakes do failed agents make?

  8. [8]

    **Recovery strategies**: How do successful agents recover from errors?

  9. [9]

    supervisor

    **Decision quality**: How do successful agents make better choices? **Step 6: GENERATE TASK-SPECIFIC GUIDANCE** Based on your contrastive analysis, generate a step-by-step guidance for this task. The goal is to add this guidance at the beginning of the task, so that the agent can always successfully complete the task. Be concise. ## Output Requirements Af...

  10. [10]

    Understanding the user 's product requirements completely

  11. [11]

    Searching for relevant products using appropriate keywords

  12. [12]

    Evaluating product details to ensure they match ALL requirements

  13. [13]

    Selecting the correct product options (size, color, etc.)

  14. [14]

    red shirt large

    Completing the purchase by clicking "buy now" ## Available Tools You have access to three tools to interact with the website: ### 1. search(query: str) Search for products using space-separated keywords. - Keywords are matched against product titles, descriptions, and attributes - Use specific keywords that capture the essential requirements - Examples: "...

  15. [15]

    **Understand Requirements**: Carefully read the task to identify all product requirements (type, color, size, features, etc.)

  16. [16]

    **Search Effectively**: - Use relevant keywords that match the product requirements - Start with general terms, then refine if needed - Review search results to find promising products

  17. [17]

    **Evaluate Products**: - Click on product ASINs to view details - Check description, features, and attributes tabs - Verify the product matches ALL requirements from the task

  18. [18]

    **Select Options**: - If the product has options (size, color), select the ones matching requirements - Use get_available_actions() if you need to see what options are available

  19. [19]

    Find a red shirt in size large

    **Complete Purchase**: - Once you 've found a product that matches all requirements and selected appropriate options - Click "buy now" to complete the task - This will end the episode and you 'll receive a reward ## Important Tips - Be thorough: Verify all requirements before purchasing - If search results don 't match, try different keywords - Use get_av...

  20. [20]

    red shirt large

    search("red shirt large")

  21. [21]

    Click on promising product ASIN

  22. [22]

    Check description/features to verify it 's actually red

  23. [23]

    Look for size options and click("large")

  24. [24]

    Verify all requirements are met

  25. [25]

    click("buy now") Now let 's complete your shopping task! 25