CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection
Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3
The pith
Agents learn to generate task-specific context from contrastive analysis of their own past execution trajectories instead of retrieving prior summaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). CAM is further optimized using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task.
What carries the argument
The Context Augmentation Model (CAM), trained first on summaries from contrastive reflection of past trajectories and then optimized by reinforcement learning driven by task-execution rewards.
If this is right
- Task completion rate on AppWorld rises from 72.62 percent to 81.15 percent.
- Average reward on a WebShop subset rises from 0.68 to 0.74.
- The execution agent receives context already adapted to the current task and therefore carries less adaptation reasoning.
- Context provision no longer depends on retrieval from a growing store of past experiences.
- The overall agent pipeline improves consistently across the evaluated benchmarks.
Where Pith is reading between the lines
- The same reflection-plus-generation loop could reduce the size of memory stores agents must maintain for long-horizon work.
- If reflection quality can be maintained on novel task distributions, the approach may extend to domains outside the current benchmarks.
- Iterative improvement of the reflection component could steadily raise the quality of training data for the context model.
- The separation of reflection, generation, and execution stages may allow independent scaling of each piece in larger agent systems.
Load-bearing premise
The reflection agent's contrastive summaries must supply high-quality, unbiased task knowledge that serves as reliable training data without introducing errors that the reinforcement-learning stage then amplifies.
What would settle it
On a new set of tasks, measuring whether the CLEAR agent completes fewer tasks or earns lower rewards than an otherwise identical retrieval baseline would directly test whether generated context improves performance.
Figures
read the original abstract
Large language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at https://github.com/awslabs/CLEAR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CLEAR, a generative context augmentation framework for LLM-based agents. A reflection agent performs contrastive analysis over past execution trajectories to produce task-specific summaries; these serve as SFT data for a Context Augmentation Model (CAM). CAM is then further optimized via policy-gradient RL whose reward is the downstream task-execution success metric. The central claim is that generating tailored context (rather than retrieving past context) reduces reasoning burden on the execution agent and yields better performance. Experiments report task-completion gains from 72.62% to 81.15% on AppWorld test and average-reward gains from 0.68 to 0.74 on a WebShop subset versus baselines.
Significance. If the central claim holds after addressing the RL-fidelity concerns, the work would provide a concrete, reproducible method for shifting context engineering from retrieval to generation, with measurable gains on two standard agent benchmarks and publicly released code. This could influence subsequent agent architectures that rely on dynamic context.
major comments (3)
- [§4.3] §4.3 (RL stage of CAM training): The reward is defined solely as the execution agent's task-success metric with no auxiliary term that penalizes divergence from the reflection-agent distribution or measures context fidelity. This leaves open the possibility that policy-gradient updates reinforce fabricated or incomplete context that happens to increase short-term execution reward, directly undermining the claim that CAM produces 'better tailored' context.
- [Experimental results] Experimental results (AppWorld and WebShop sections): The reported improvements (72.62% → 81.15% completion; 0.68 → 0.74 reward) are presented without ablations separating SFT-only CAM from RL-optimized CAM, without context-quality metrics (e.g., factual accuracy or completeness audits of generated summaries), and without statistical significance tests or variance across runs. These omissions make it impossible to determine whether the gains stem from genuine task-tailoring or from reward hacking.
- [§3.2] §3.2 (Contrastive reflection): The reflection agent's contrastive summaries are treated as high-quality, unbiased training targets for SFT, yet no human or automated validation of summary faithfulness to the original trajectories is reported. If these summaries already contain systematic biases or omissions, both SFT and subsequent RL will propagate them.
minor comments (2)
- [Method] Notation for the CAM policy and the reflection agent is introduced without a consolidated table of symbols, making it harder to track which components are frozen during RL.
- [Abstract / Experiments] The abstract states 'comprehensive evaluations' but supplies no information on the number of runs, random seeds, or exact baseline implementations; this should be clarified in the experimental section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the CLEAR framework. We address each major comment point-by-point below, providing clarifications on our design choices and indicating revisions where the manuscript will be updated to strengthen the claims.
read point-by-point responses
-
Referee: [§4.3] §4.3 (RL stage of CAM training): The reward is defined solely as the execution agent's task-success metric with no auxiliary term that penalizes divergence from the reflection-agent distribution or measures context fidelity. This leaves open the possibility that policy-gradient updates reinforce fabricated or incomplete context that happens to increase short-term execution reward, directly undermining the claim that CAM produces 'better tailored' context.
Authors: We acknowledge the validity of this concern: the RL reward is purely the downstream task-success metric, which could theoretically encourage reward hacking if the SFT initialization is insufficient. Our design choice was to let the task metric directly optimize the ultimate goal (agent performance) while relying on the contrastive SFT data to anchor generation to trajectory-derived knowledge. However, we agree this leaves a gap in fidelity guarantees. In the revision we will add an explicit discussion of this risk, include an auxiliary KL-divergence term to the reflection distribution in the reported RL experiments, and report measured divergence statistics. revision: partial
-
Referee: [Experimental results] Experimental results (AppWorld and WebShop sections): The reported improvements (72.62% → 81.15% completion; 0.68 → 0.74 reward) are presented without ablations separating SFT-only CAM from RL-optimized CAM, without context-quality metrics (e.g., factual accuracy or completeness audits of generated summaries), and without statistical significance tests or variance across runs. These omissions make it impossible to determine whether the gains stem from genuine task-tailoring or from reward hacking.
Authors: We agree these omissions weaken interpretability. The original experiments focused on end-to-end comparison of full CLEAR against baselines, but did not isolate SFT versus RL contributions or provide quality audits. In the revised manuscript we will add: (i) an ablation table comparing SFT-only CAM against the RL stage, (ii) automated context-quality metrics (entailment and completeness scores via LLM-as-judge against source trajectories), and (iii) mean ± std results over multiple runs with statistical significance tests. These additions will clarify whether gains arise from tailoring or other factors. revision: yes
-
Referee: [§3.2] §3.2 (Contrastive reflection): The reflection agent's contrastive summaries are treated as high-quality, unbiased training targets for SFT, yet no human or automated validation of summary faithfulness to the original trajectories is reported. If these summaries already contain systematic biases or omissions, both SFT and subsequent RL will propagate them.
Authors: The reflection agent is prompted to extract task-useful context via explicit contrastive analysis, but we did not report quantitative faithfulness validation in the submission. This is a genuine limitation of the current evidence. We will revise §3.2 to include both a human evaluation on a random sample of summaries (faithfulness and completeness ratings) and an automated consistency check against the original trajectories. Any identified biases will be discussed along with mitigation approaches. revision: partial
Circularity Check
No circularity: empirical gains measured on external held-out benchmarks
full rationale
The paper describes an algorithmic pipeline (reflection agent produces summaries from past trajectories; these serve as SFT targets for CAM; CAM is then RL-tuned with downstream task-execution reward). All reported gains are measured on separate test sets (AppWorld test set, WebShop subset) using task-completion rate and averaged reward supplied by an independent execution agent. No mathematical equations, self-definitional identities, or fitted parameters are presented whose outputs are then relabeled as predictions. The central claim therefore rests on external benchmark signals rather than reducing to quantities defined inside the paper's own training loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
error|Error|ERROR|exception|Exception|failed|Failed
= 20 times, which effectively expands the training dataset by a factor of 20. E.2 Supervised Fine-T uning We further randomly split DSFT into 80% for training and 20% for validation. The CAM πC θ (·) is initialized from a Qwen/Qwen3-32B model (Yang et al., 2025), downloaded from HuggingFace9. We perform full-parameter fine-tuning for 5 epochs using 8 NVID...
work page 2025
-
[2]
**Identify successful vs failed traces** using reward/success metrics
-
[3]
**Compare agent behaviors** between successful and failed cases
-
[4]
**Find patterns** that differentiate success from failure
-
[5]
**Generate insights** about what works and what doesn 't ## Output Format Your final output should be a task-specific guidance in the following format: <guidance> The strategy for completing this task is ... </guidance> 20 Preprint. F.2 User Prompt User Prompt for the Reflection Agent You are given a folder that contains multiple LLM agent trajectory trac...
-
[6]
**Differentiating behaviors**: What do successful agents do that failed ones don't?
-
[7]
**Common pitfalls**: What mistakes do failed agents make?
-
[8]
**Recovery strategies**: How do successful agents recover from errors?
-
[9]
**Decision quality**: How do successful agents make better choices? **Step 6: GENERATE TASK-SPECIFIC GUIDANCE** Based on your contrastive analysis, generate a step-by-step guidance for this task. The goal is to add this guidance at the beginning of the task, so that the agent can always successfully complete the task. Be concise. ## Output Requirements Af...
-
[10]
Understanding the user 's product requirements completely
-
[11]
Searching for relevant products using appropriate keywords
-
[12]
Evaluating product details to ensure they match ALL requirements
-
[13]
Selecting the correct product options (size, color, etc.)
-
[14]
Completing the purchase by clicking "buy now" ## Available Tools You have access to three tools to interact with the website: ### 1. search(query: str) Search for products using space-separated keywords. - Keywords are matched against product titles, descriptions, and attributes - Use specific keywords that capture the essential requirements - Examples: "...
-
[15]
**Understand Requirements**: Carefully read the task to identify all product requirements (type, color, size, features, etc.)
-
[16]
**Search Effectively**: - Use relevant keywords that match the product requirements - Start with general terms, then refine if needed - Review search results to find promising products
-
[17]
**Evaluate Products**: - Click on product ASINs to view details - Check description, features, and attributes tabs - Verify the product matches ALL requirements from the task
-
[18]
**Select Options**: - If the product has options (size, color), select the ones matching requirements - Use get_available_actions() if you need to see what options are available
-
[19]
Find a red shirt in size large
**Complete Purchase**: - Once you 've found a product that matches all requirements and selected appropriate options - Click "buy now" to complete the task - This will end the episode and you 'll receive a reward ## Important Tips - Be thorough: Verify all requirements before purchasing - If search results don 't match, try different keywords - Use get_av...
- [20]
-
[21]
Click on promising product ASIN
-
[22]
Check description/features to verify it 's actually red
-
[23]
Look for size options and click("large")
-
[24]
Verify all requirements are met
-
[25]
click("buy now") Now let 's complete your shopping task! 25
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.