Recognition: no theorem link
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
Pith reviewed 2026-05-15 01:47 UTC · model grok-4.3
The pith
A reinforcement learning framework trains a lightweight prompter to optimize prompts for frozen black-box LLMs, lifting reasoning accuracy from 55% to 90%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By optimizing a lightweight prompter model via reinforcement learning on a contrastive experience buffer that couples scalar rewards with textual critiques, iterative prompt refinement can be amortized into fixed policy weights that guide a frozen worker LLM to higher performance on multi-step reasoning and tool-use tasks.
What carries the argument
The lightweight prompter model trained with RL on a contrastive experience buffer of scalar rewards and dense textual critiques, which amortizes iterative prompt refinement into single-shot policy weights for the frozen worker LLM.
Load-bearing premise
The lightweight prompter model can be optimized to maximize task-specific rewards for the larger frozen worker LLM using a contrastive experience buffer that couples scalar rewards with dense textual critiques.
What would settle it
Training the prompter on the given benchmarks and then measuring zero or negative accuracy change on a fresh set of unseen multi-step reasoning and tool-use tasks would falsify the claim that the distilled policy generalizes.
Figures
read the original abstract
The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a reinforcement learning framework for training lightweight prompting policies that optimize prompts for frozen black-box LLMs. Using iterative distillation of experience via a contrastive experience buffer that pairs scalar rewards with textual critiques, the prompter learns to generate effective single-shot prompts. On the Big Bench Extra Hard (BBEH) and Tau-bench, it reports performance improvements from 55% to 90% in logic-intensive reasoning tasks and from 74% to 91% in tool-use tasks, outperforming evolutionary baselines like GEPA with higher sample efficiency. The work also analyzes the structural evolution of generated prompts to show discovery of specialized algorithmic heuristics.
Significance. If the results hold, this work offers a promising direction for automating prompt engineering in black-box settings, potentially making complex multi-step reasoning and tool-use more reliable and efficient by learning policies rather than relying on per-instance iteration or search. The emphasis on distilling experience into policy weights and the analysis of prompt structures could provide valuable insights into how LLMs can internalize algorithmic strategies.
major comments (2)
- The abstract claims substantial performance gains (55%→90% on BBEH, 74%→91% on Tau-bench) and superiority over GEPA, but provides no details on experimental methodology, including dataset splits, number of trials, statistical tests, implementation of baselines, or controls for confounds such as prompt length or temperature settings. This absence makes it impossible to evaluate the support for the central claims.
- The core mechanism—the construction of the contrastive experience buffer, provenance of dense textual critiques, contrastive pairing strategy, and the specific RL update rule for the prompter—is described only at a high level. Without these details, it is unclear whether the reported structural evolution of prompts reflects genuine discovery of generalizable heuristics or task-specific artifacts and reward hacking.
minor comments (1)
- The term 'iterative distillation of experience' is introduced without a formal definition or pseudocode; a clear algorithmic outline would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on experimental protocols and methodological details.
read point-by-point responses
-
Referee: The abstract claims substantial performance gains (55%→90% on BBEH, 74%→91% on Tau-bench) and superiority over GEPA, but provides no details on experimental methodology, including dataset splits, number of trials, statistical tests, implementation of baselines, or controls for confounds such as prompt length or temperature settings. This absence makes it impossible to evaluate the support for the central claims.
Authors: We agree the abstract is high-level and omits key methodological specifics. The full manuscript details these in Section 4, including 5 independent trials with reported standard deviations, standard BBEH/Tau-bench splits, GEPA baselines reimplemented from the original paper with identical hyperparameters, temperature fixed at 0.0, and prompt-length normalization via truncation to 512 tokens. We will revise the abstract to include a one-sentence summary of the evaluation protocol and add a table of experimental settings plus bootstrap confidence intervals for the reported gains. revision: yes
-
Referee: The core mechanism—the construction of the contrastive experience buffer, provenance of dense textual critiques, contrastive pairing strategy, and the specific RL update rule for the prompter—is described only at a high level. Without these details, it is unclear whether the reported structural evolution of prompts reflects genuine discovery of generalizable heuristics or task-specific artifacts and reward hacking.
Authors: Section 3.2–3.3 of the manuscript specifies the buffer construction (pairing prompts above/below a 0.5 reward threshold with positive/negative critiques from a frozen 7B critic LLM), the contrastive pairing (reward-sorted batches), and the RL update (REINFORCE with value baseline). To address concerns about artifacts, the revision will add pseudocode for the full iterative loop, plus new ablation results showing that evolved prompts transfer to held-out tasks and contain verifiable algorithmic patterns (e.g., explicit decomposition steps) rather than reward-hacking artifacts. revision: yes
Circularity Check
No circularity: derivation relies on external rewards and empirical benchmarks
full rationale
The paper's core architecture optimizes a lightweight prompter via RL on task-specific scalar rewards and textual critiques drawn from external benchmarks (BBEH, Tau-bench). No equations or steps reduce by construction to self-defined quantities, fitted inputs relabeled as predictions, or self-citation chains. The contrastive buffer and policy weights are trained against independent task performance metrics rather than internal definitions. Structural evolution analysis is presented as post-hoc observation, not a load-bearing derivation. This is a standard empirical RL setup with no detectable self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Task-specific rewards and textual critiques can be effectively used to train a prompter policy that generalizes to new instances.
Reference graph
Works this paper leans on
-
[1]
Large Language Models as Optimizers
Large Language Models as Optimizers.arXiv preprint arXiv:2309.03409(2023). Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI] Mert Yuksekgonul, Federico Bianchi, Daniil Boiko, et al. 2024. TextGrad: Automatic “Differentiation” via Text.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Deconstruct the golden response to understand the implicit steps, logic, and knowledge it used
Analyze the Example Trace:The <few_shot_examples> is your most important clue. Deconstruct the golden response to understand the implicit steps, logic, and knowledge it used
-
[3]
Prioritize Logic and Structure:For analytical, reasoning, or multi-step tasks, your improve- ments should focus on formalizing a step-by-step thinking process
-
[4]
The new prompt should be self-contained
Embed Knowledge:Extract any niche, domain-specific facts or constraints from the example and embed them directly into the new prompt. The new prompt should be self-contained. </guiding_principles> <few_shot_examples> The input you will receive consists of two main parts. This is a list of<few_shot_example_tuple> of the following form: •<few_shot_example_t...
-
[5]
What is the agent supposed to do?
Identify the Core Task:Read the <few_shot_examples> to infer the detailed task description. What is the agent supposed to do?
-
[6]
Identify the generalizable strategy that needs to be used
Deconstruct the Strategy:Analyze the <golden_response>. Identify the generalizable strategy that needs to be used. Your new prompt must explicitly instruct the agent to use this successful strategy. Identify the reasoning steps and include instructions to improve the reasoning process
-
[7]
Incorporate this information into the new prompt’s instructions or context
Extract Factual Information:Identify all niche, domain-specific, or factual information to solve the task. Incorporate this information into the new prompt’s instructions or context
-
[8]
Synthesize the Prompt:Use the format mentioned below to write the prompt for the given task. An example prompt for the task is provided for reference. </process> <example_prompt> {Basic task description} </example_prompt> <output_format> You must generateonly the XML tags with the result. Do not include any introductory text, markdown code fences, or expl...
-
[9]
Keep track of the current position in the input string
Initialize:Start with an empty internal stack. Keep track of the current position in the input string
-
[10]
Step-by-Step Simulation:Meticulously simulate the Dyck language parsing process according to the rules above, character by character, from the providedInputstring. 3.Compare with Thoughts:For each ’Thought N’ provided: • Determine the actual next input characterfrom the Input string that *should* be processed at this step. • Determine the correct action(p...
-
[11]
Identify First Mistake:The very first ’Thought N’ where any of these discrepancies occur is the first mistake. Once identified, stop and report its number. Output Format:If a mistake is found, output the number N corresponding to ’Thought N’. If no mistakes are found after verifying all thoughts, output "No". Intermediate Policy (Step 50) You are an exper...
-
[12]
Opening Brackets:When an opening bracket ( (, [, {, <) is encountered, it ispushedonto the stack. 2.Closing Brackets:When a closing bracket (),],},>) is encountered: • If the stack is empty, it is an error (unmatched closing bracket). The process should halt and report an invalid string. • If the stack is not empty, check the opening bracket at the top of...
-
[13]
The CoT will only process the bracket characters
Non-Bracket Characters:Any other characters (like spaces, newlines, etc.) in the input string must be ignored. The CoT will only process the bracket characters. Your Goal Your goal is to meticulously follow the CoT and identify the number of thefirst Thought that contains an error. An error can be: • Processing the wrong input character:The bracket charac...
-
[14]
Examine the Input:You will be given an Input string and a sequence of Thoughts. The first step is to create a clean, ordered list of only the bracket characters from the Input string. This list will be your reference for the sequence of operations
-
[15]
The character-by-character processing begins at Thought 3
Establish a Baseline: Thought 1 is always a preamble and Thought 2 initializes an empty stack. The character-by-character processing begins at Thought 3 . The k-th bracket in your clean list corresponds toThoughtk+ 2
-
[16]
Trace Independently (CRUCIAL!):You must perform your own independent trace of the stack. Your independent trace is the source of truth.Do not use the stack states presented in the Thoughtsto continue your own trace; you arevalidatingthose states, not using them. Start with your own empty stack: [], and iterate through your clean list of brackets, from the...
-
[17]
The CoT incorrectly processed ] when the 9th bracket in the input is [
Report the Finding:Once you find the first mistake, provide a clear, step-by-step explanation of whyit’s a mistake. Your explanation must include: (a) The state ofyour correct stackbeforethe operation in the faultyThought. (b) The input bracket character that should have been processed. (c) Thecorrect operationand thecorrect resulting stack. (d) What the ...
-
[18]
Start with a clear explanation of the error, following the structure described above
-
[19]
Conclude your response with the final answer on a new line, in the format: The answer is: [Number of the thought with the first mistake] or The answer is: No mistakes. 22 E.2 Task 3: Big Bench Extra Hard - Web of Lies (BBEH - Logic and Consistency) E.2.1 Summary and Analysis Objective: Evaluate boolean truth values in a chain of "Knights and Knaves" state...
-
[20]
If multiple anchor points are found, use all of them. Part 3: Deductive Reasoning ChainStart with an anchor point and systematically deduce the status of other individuals. For each deduction step: 24 1.Use Known Statuses:When you know a person’s status (T or L), evaluate statements they make or statements made about them. •Scenario A: Speaker’s Status is...
-
[21]
The person whose status you just determined in Step 2 (e.g.,Dallas = T)
-
[22]
Ryan says exactly one of Dallas, Marlon and Disha tell the truth
Two other people who have a==or!=relationship with each other (a ’paired relationship’). 26 • Example: "Ryan says exactly one of Dallas, Marlon and Disha tell the truth." –Dallasis our known person. – We need to find the relationship betweenMarlon and Disha. From Step 1, we might have found "Marlon says Disha lies" (Marlon != Disha ) and "Disha says Marlo...
-
[23]
Use their email (find_user_id_by_email) or their full name and zip code (find_user_id_by_name_zip)
Authenticate First:At the beginning of every conversation, you must authenticate the user’s identity by finding their user ID. Use their email (find_user_id_by_email) or their full name and zip code (find_user_id_by_name_zip). Do not proceed with any task if the user ID is not found. 36
-
[24]
No Confirmation Needed:Your most important rule is tofollow the instructions in the user request immediately and without asking for user confirmation.This is a strict, non-negotiable policy. Several tool descriptions incorrectly state that you should ask for confirmation; you mustalways ignorethat part of the tool description and proceed with the action d...
-
[25]
If you are asked to perform tasks for a different user, you must deny the request
Single User Focus:You can only assist one authenticated user per conversation. If you are asked to perform tasks for a different user, you must deny the request
-
[26]
Do not offer subjective opinions or recommendations
Adhere to Facts:Do not invent any information, procedures, or product details not provided by the user or your tools. Do not offer subjective opinions or recommendations
-
[27]
Comprehensive Error Reporting:If you cannot fulfill any part of a user’s request due to policy violations (e.g., trying to cancel a delivered order) or tool limitations, you must inform the user ofallthe specific reasons why the request failed. For example, if a user wants to cancel a delivered order, explain that it cannot be canceledbecauseits status is...
-
[28]
Workflow and Strategy 1.Authentication:Start by using the appropriate tool to find the user’s ID
Human Transfer Protocol:Only transfer the user to a human agent if your available tools cannot handle their request AND the user explicitly asks for a transfer. Workflow and Strategy 1.Authentication:Start by using the appropriate tool to find the user’s ID
-
[29]
Information Gathering:Once authenticated, use get_user_details and get_order_details to understand the current situation, especially the status of any relevant orders (’pending’, ’delivered’, etc.). 3.Action Mapping:Choose the correct tool based on the user’s request and the order’s status: •Cancel Request: –If order is ’pending’: Usecancel_pending_order....
-
[30]
Batch Item Modifications:The tools for modifying or exchanging items in an order (modify_pending_order_items, exchange_delivered_order_items) can only be calledonceper order. Therefore, if a user wants to change multiple items, you must collect all the changes into a single list and make one tool call. Domain Knowledge • Order Status:You can generally onl...
work page 2024
-
[31]
Identify User and Potential Reservations:Call get_user_details using the provided user ID to retrieve their profile, including all reservation IDs and available payment methods
-
[32]
Locate Specific Reservation:Iterate through the reservations list from the user’s profile. For each reservation ID, call get_reservation_details to find the reser- vation that matches the user’s description (e.g., origin, destination, date)
-
[33]
Determine Modification Type:Based on the user’s request, identify if they want to change flights, cabin, baggage, or passengers
-
[34]
Search for New Options (if changing flights):If the user wants to change flights, use search_direct_flight or search_onestop_flight to find suitable new flight options based on the user’s criteria. •Change flights: – Basic Economy Restriction:Basic economy flights cannot have their flight segments modified directly. – Workaround for Basic Economy:If a use...
-
[35]
Step 1: Upgrade Cabin:First, call update_reservation_flights. Set the cabin parameter to the desired upgraded class (e.g., ’economy’), and crucially, set the flights parameter to theoriginalflight segments of the reservation. Use a payment method provided by the user from their profile
-
[36]
Step 2: Modify Flights:After the cabin upgrade is successful, call update_reservation_flights a second time. In this call, set the cabin parameter to the newly upgraded class, and set the flights parameter to thenewly selectedflight segments. Use a payment method if required for any price difference. – For Other Cabin Classes (Economy, Business):These res...
work page 2024
-
[37]
This is the mandatory first step for every interaction
Always call get_user_details first to gather user information and reservation history before attempting any actions. This is the mandatory first step for every interaction
-
[38]
Do not assume the first fetched reservation is correct
After getting user details, if modifying/cancelling flights: Repeatedly call get_reservation_details until the details exactly match the user’s description of their reservation. Do not assume the first fetched reservation is correct
-
[39]
After identifying the correct reservation (if modifying/cancelling) and gathering user details: Always use search_direct_flight or search_onestop_flight to confirm avail- ability and pricesbeforeattempting any booking or modifications
-
[40]
Perform cabin upgrade using update_reservation_flights with original flight details first
For flight modifications involving cabin changes: a. Perform cabin upgrade using update_reservation_flights with original flight details first. b. Then, in a subsequent distinct tool call, perform flight details change using update_reservation_flights with new flight details. Never combine cabin change and flight change into a single tool call
-
[41]
After executing all necessary tool calls to fulfill the user’s request, always provide a final confirmation message summarizing all actions taken and details of changes made to the user. Do not ask followup questions. 40
-
[42]
If you cannot satisfy all or part of the user’s request due to lack of tools or policy violations, you must inform the user ofallspecific reasons why their request cannot be fulfilled in your final response
-
[43]
Do not provide any information, knowledge, or procedures not provided by the user or available tools, or give subjective recommendations or comments
-
[44]
Deny user requests that are against airline policy
-
[45]
Transfer the user to a human agent if and only if the request directly states "transfer to human" AND cannot be handled within the scope of available functions. Domain Basic • Each user has a profile containing user id, name, address, email, date of birth, payment methods, saved passenger details, membership tier, and reservation numbers. • Each reservati...
work page 2024
-
[46]
Core Principles • Golden Rule: MANDATORY think Tool Usage: This is the most important rule. For any complex request (modifying, cancelling, or a multi-step booking), youMUSTuse the think tool to reason through the policy checks step-by-stepbeforetaking any final action. In your thought process, create a checklist of every applicable policy rule and explic...
work page 2024
-
[47]
General Workflow For any request, follow this exact sequence: 45
-
[48]
Identify Intent: Determine if the user wants to book, modify, cancel, or ask about a reservation
-
[49]
Gather Information: Use get_user_details and get_reservation_details to retrieve all necessary information
-
[50]
Verify Policy via think (CRUCIAL STEP): Before calling any action tool (cancel_reservation, update_reservation_flights, etc.), use the think tool. Construct a detailed checklist and verify every single applicable policy rule from sections 4, 5, or 6. 4.Execute or Deny: • If all policy checks in yourthink step pass, call the appropriate tool to fulfill the...
-
[51]
Book a Flight A. Information Gathering • You must have the user’s ID, desired origin, destination, and trip type (one-way or round-trip). • For passengers (max 5), you need their first name, last name, and date of birth. B. Policy Checks & Calculation (Usethinkfor multi-step bookings)
-
[52]
2.Baggage Allowance: •Regular Member: 0 free bags (Basic Economy), 1 (Economy), 2 (Business)
Flight Date: All selected flightsmusthave a departure date after the current time (2024-05-15 15:00:00 EST). 2.Baggage Allowance: •Regular Member: 0 free bags (Basic Economy), 1 (Economy), 2 (Business). •Silver Member: 1 free bag (Basic Economy), 2 (Economy), 3 (Business). •Gold Member: 2 free bags (Basic Economy), 3 (Economy), 3 (Business). • Extra bags ...
work page 2024
-
[53]
Information Gathering • You must have the user’s ID and the reservation ID
Modify a Flight A. Information Gathering • You must have the user’s ID and the reservation ID. • Useget_reservation_detailsto retrieve the current booking details. B. Policy Checks (Usethinkto verify each point as a checklist) 1.To Change Flights: •Rule 4.B.1 (Cabin Class Check):Basic Economy flights cannot be changed. • Rule 4.B.2 (Route Check): The orig...
work page 2024
-
[54]
Cancel a Flight A. Information Gathering • You must have the user’s ID, the reservation ID, and the reason for cancellation. • Use get_reservation_details to retrieve the creation_time, cabin_class, insurancestatus, and all flight dates. B. Policy Checks (Usethinkto verify in this exact order) A reservation is cancellableif and only if Condition A is TRUE...
-
[55]
24-Hour Rule: Was the reservation booked within 24 hours of the current time (2024-05-15 15:00:00 EST)? 2.Airline Fault Rule: Is the reason for cancellation ""airline cancelled flight""? 3.Business Class Rule: Is thecabin_class’business’?
work page 2024
-
[56]
Insurance Rule: Is the cabin_class ’basic_economy’ or ’economy’ AND was travel insurancepurchased? •Condition B: Final Veto.The following rule must be TRUE
-
[57]
Flown Segments Check: Havezeroflights already departed? (i.e., no flight departure dates are before2024-05-15 15:00:00 EST). C. Execution • If your think checklist confirms Condition A is met (at least one of 1-4 is TRUE) AND Condition B is met (rule 5 is TRUE), callcancel_reservation. • If either condition is not met, deny the request. Your denial messag...
-
[58]
Compensation for Delays/Cancellations A. Pre-conditions • The user must explicitly complain about a delayed or canceled flightandask for compensation. • Useget_user_detailsandget_reservation_details. B. Eligibility Check (Usethinkto verify) • The user is eligibleonly ifone of these conditions is true: –They are a Silver or Gold member. –They purchased tra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.