Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Pith reviewed 2026-05-22 13:30 UTC · model grok-4.3
The pith
Large language models can optimize scalar rewards during inference by incorporating past responses and scores into successive prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs perform reinforcement learning during inference through ICRL prompting, in which each new response is generated from a context that concatenates all earlier outputs together with their scalar rewards, producing measurable quality gains on tasks including Game of 24 and Olympiad math without parameter changes.
What carries the argument
ICRL prompting, a multi-round scheme that appends each response and its scalar reward to the growing context so the model can optimize subsequent generations toward higher rewards.
Load-bearing premise
The measured gains arise specifically because the LLM optimizes over the provided scalar reward signals rather than from the generic benefits of longer context or repeated prompting.
What would settle it
Replacing task-relevant rewards with random or constant values should eliminate the progressive quality gains across rounds if the effect depends on meaningful reinforcement signals.
read the original abstract
Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that reinforcement learning emerges during LLM inference through a multi-round prompting framework called ICRL prompting. After generating a response, the model receives a scalar reward and the next prompt concatenates all prior (response, reward) pairs; response quality is reported to improve as context grows. Evaluations on Game of 24, creative writing, ScienceWorld, and Olympiad math (AIME/HMMT) show gains over Self-Refine and Reflexion, including when rewards are self-generated by the LLM.
Significance. If the central claim is substantiated, the work identifies a new inference-time scaling mechanism in which LLMs perform reward optimization in context without parameter updates. The multi-task empirical results and the finding that self-generated rewards suffice are notable strengths that could inform broader test-time compute strategies.
major comments (2)
- [Experimental design (Section 4 / Evaluation)] Experimental design (Section 4 / Evaluation): No ablation is described that holds context length fixed while randomizing or replacing the scalar reward values (e.g., using random numbers, constants, or shuffled rewards). Without this control, observed gains cannot be confidently attributed to RL-style optimization over the specific reward signals rather than generic accumulation of in-context examples or iterative prompting. This directly bears on the weakest assumption and the distinction from prior refinement methods.
- [Results reporting (Abstract and Section 4)] Results reporting (Abstract and Section 4): Claims of 'consistent outperformance' and 'significant improvements' are presented without reported details on the number of independent runs, standard deviations, statistical tests, or exact reward-generation procedures. These omissions leave the reliability of the central empirical claim only moderately supported.
minor comments (2)
- [ICRL Prompting Framework] The prompt template and exact concatenation format for ICRL prompting would benefit from an explicit example or pseudocode in the main text to improve reproducibility.
- [Related Work] Ensure the related-work discussion explicitly contrasts ICRL with Reflexion and Self-Refine on the dimension of scalar reward usage versus textual feedback.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and describe the revisions planned for the manuscript.
read point-by-point responses
-
Referee: Experimental design (Section 4 / Evaluation): No ablation is described that holds context length fixed while randomizing or replacing the scalar reward values (e.g., using random numbers, constants, or shuffled rewards). Without this control, observed gains cannot be confidently attributed to RL-style optimization over the specific reward signals rather than generic accumulation of in-context examples or iterative prompting. This directly bears on the weakest assumption and the distinction from prior refinement methods.
Authors: We agree that an ablation holding context length fixed while randomizing or replacing rewards would provide clearer evidence that gains arise from optimizing the specific scalar signals rather than from iterative prompting or context accumulation alone. In the revised manuscript we will add this control experiment on Game of 24 and the math tasks, comparing original rewards against random numbers, constant values, and shuffled rewards while keeping total context length identical. This directly addresses the distinction from prior refinement methods. revision: yes
-
Referee: Results reporting (Abstract and Section 4): Claims of 'consistent outperformance' and 'significant improvements' are presented without reported details on the number of independent runs, standard deviations, statistical tests, or exact reward-generation procedures. These omissions leave the reliability of the central empirical claim only moderately supported.
Authors: We appreciate the call for greater transparency. The revised manuscript will report the number of independent runs per task, include standard deviations or error bars, describe any statistical tests performed, and provide precise details on reward generation (including the exact prompting used for self-generated rewards). These additions will be placed in Section 4 and referenced in the abstract. revision: yes
Circularity Check
No circularity: purely empirical demonstration with no self-referential derivation
full rationale
The paper advances an empirical claim that LLMs exhibit in-context RL behavior when prompted with concatenated prior responses and scalar rewards, supported by performance gains on Game of 24, creative writing, ScienceWorld, and math benchmarks relative to Self-Refine and Reflexion. No equations, fitted parameters, or derivation chain are presented that reduce the observed improvements to inputs defined within the paper itself; the central result is a direct experimental observation of quality scaling with context length and reward feedback. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes that would render the outcome tautological by construction, making the demonstration self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Concatenating prior responses and their scalar rewards in the prompt enables the LLM to improve subsequent responses in a manner analogous to reinforcement learning.
invented entities (1)
-
ICRL prompting framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the LLM is able to maximize the scalar reward signal during the inference time, just like an RL algorithm... the scalar reward signal is the only feedback we provide
-
IndisputableMonolith/Foundation/AbsoluteFloorClosureabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
performance drop when the reward is absent... performance drop with short context
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
-
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems
CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
-
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.