Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Amir Moeini; Kefan Song; Lei Gong; Peng Wang; Rohan Chandra; Shangtong Zhang; Yanjun Qi

arxiv: 2506.06303 · v6 · submitted 2025-05-21 · 💻 cs.LG · cs.AI· cs.CL

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song , Amir Moeini , Peng Wang , Lei Gong , Rohan Chandra , Shangtong Zhang , Yanjun Qi This is my paper

Pith reviewed 2026-05-22 13:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords in-context reinforcement learningLLM self-improvementtest-time scalingmulti-round promptingscalar reward optimizationGame of 24math competitions

0 comments

The pith

Large language models can optimize scalar rewards during inference by incorporating past responses and scores into successive prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning behavior emerges at inference time in LLMs when they receive numerical scalar feedback on their outputs. A multi-round prompting method builds a context of all prior answers paired with their rewards, and the model generates improved responses in later rounds without any weight updates. This process is tested across puzzle solving, creative writing, scientific simulation, and high-level math contests, where it yields better results than prior self-correction techniques. The same LLM can supply the rewards and still show gains, suggesting a route to test-time improvement driven by reward signals in context.

Core claim

LLMs perform reinforcement learning during inference through ICRL prompting, in which each new response is generated from a context that concatenates all earlier outputs together with their scalar rewards, producing measurable quality gains on tasks including Game of 24 and Olympiad math without parameter changes.

What carries the argument

ICRL prompting, a multi-round scheme that appends each response and its scalar reward to the growing context so the model can optimize subsequent generations toward higher rewards.

Load-bearing premise

The measured gains arise specifically because the LLM optimizes over the provided scalar reward signals rather than from the generic benefits of longer context or repeated prompting.

What would settle it

Replacing task-relevant rewards with random or constant values should eliminate the progressive quality gains across rounds if the effect depends on meaningful reinforcement signals.

read the original abstract

Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that reinforcement learning emerges during LLM inference through a multi-round prompting framework called ICRL prompting. After generating a response, the model receives a scalar reward and the next prompt concatenates all prior (response, reward) pairs; response quality is reported to improve as context grows. Evaluations on Game of 24, creative writing, ScienceWorld, and Olympiad math (AIME/HMMT) show gains over Self-Refine and Reflexion, including when rewards are self-generated by the LLM.

Significance. If the central claim is substantiated, the work identifies a new inference-time scaling mechanism in which LLMs perform reward optimization in context without parameter updates. The multi-task empirical results and the finding that self-generated rewards suffice are notable strengths that could inform broader test-time compute strategies.

major comments (2)

[Experimental design (Section 4 / Evaluation)] Experimental design (Section 4 / Evaluation): No ablation is described that holds context length fixed while randomizing or replacing the scalar reward values (e.g., using random numbers, constants, or shuffled rewards). Without this control, observed gains cannot be confidently attributed to RL-style optimization over the specific reward signals rather than generic accumulation of in-context examples or iterative prompting. This directly bears on the weakest assumption and the distinction from prior refinement methods.
[Results reporting (Abstract and Section 4)] Results reporting (Abstract and Section 4): Claims of 'consistent outperformance' and 'significant improvements' are presented without reported details on the number of independent runs, standard deviations, statistical tests, or exact reward-generation procedures. These omissions leave the reliability of the central empirical claim only moderately supported.

minor comments (2)

[ICRL Prompting Framework] The prompt template and exact concatenation format for ICRL prompting would benefit from an explicit example or pseudocode in the main text to improve reproducibility.
[Related Work] Ensure the related-work discussion explicitly contrasts ICRL with Reflexion and Self-Refine on the dimension of scalar reward usage versus textual feedback.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and describe the revisions planned for the manuscript.

read point-by-point responses

Referee: Experimental design (Section 4 / Evaluation): No ablation is described that holds context length fixed while randomizing or replacing the scalar reward values (e.g., using random numbers, constants, or shuffled rewards). Without this control, observed gains cannot be confidently attributed to RL-style optimization over the specific reward signals rather than generic accumulation of in-context examples or iterative prompting. This directly bears on the weakest assumption and the distinction from prior refinement methods.

Authors: We agree that an ablation holding context length fixed while randomizing or replacing rewards would provide clearer evidence that gains arise from optimizing the specific scalar signals rather than from iterative prompting or context accumulation alone. In the revised manuscript we will add this control experiment on Game of 24 and the math tasks, comparing original rewards against random numbers, constant values, and shuffled rewards while keeping total context length identical. This directly addresses the distinction from prior refinement methods. revision: yes
Referee: Results reporting (Abstract and Section 4): Claims of 'consistent outperformance' and 'significant improvements' are presented without reported details on the number of independent runs, standard deviations, statistical tests, or exact reward-generation procedures. These omissions leave the reliability of the central empirical claim only moderately supported.

Authors: We appreciate the call for greater transparency. The revised manuscript will report the number of independent runs per task, include standard deviations or error bars, describe any statistical tests performed, and provide precise details on reward generation (including the exact prompting used for self-generated rewards). These additions will be placed in Section 4 and referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical demonstration with no self-referential derivation

full rationale

The paper advances an empirical claim that LLMs exhibit in-context RL behavior when prompted with concatenated prior responses and scalar rewards, supported by performance gains on Game of 24, creative writing, ScienceWorld, and math benchmarks relative to Self-Refine and Reflexion. No equations, fitted parameters, or derivation chain are presented that reduce the observed improvements to inputs defined within the paper itself; the central result is a direct experimental observation of quality scaling with context length and reward feedback. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes that would render the outcome tautological by construction, making the demonstration self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central observation rests on the domain assumption that LLMs can interpret and act upon explicit scalar reward signals embedded in growing context; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Concatenating prior responses and their scalar rewards in the prompt enables the LLM to improve subsequent responses in a manner analogous to reinforcement learning.
This premise is invoked to interpret the observed performance gains as in-context RL.

invented entities (1)

ICRL prompting framework no independent evidence
purpose: To structure multi-round interaction so that LLMs exhibit reward optimization during inference
New prompting procedure introduced by the authors; no independent falsifiable evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5776 in / 1272 out tokens · 47065 ms · 2026-05-22T13:30:49.292648+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the LLM is able to maximize the scalar reward signal during the inference time, just like an RL algorithm... the scalar reward signal is the only feedback we provide
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

performance drop when the reward is absent... performance drop with short context

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents
cs.LG 2026-05 conditional novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems
cs.IR 2026-04 unverdicted novelty 6.0

CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
cs.AI 2026-04 unverdicted novelty 6.0

Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.