pith. sign in

arxiv: 2512.24601 · v3 · submitted 2025-12-31 · 💻 cs.AI · cs.CL

Recursive Language Models

Pith reviewed 2026-05-16 19:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords recursive language modelslong-context processinginference-time scalingprompt decompositionLLM scaffoldingrecursive self-calls
0
0 comments X p. Extension

The pith

Recursive Language Models let LLMs handle prompts up to 100 times their context window by decomposing inputs and recursively calling themselves on smaller pieces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Recursive Language Models as an inference-time method that treats long inputs as an external environment. The LLM receives programmatic instructions to examine the full prompt, break it into manageable snippets, and call itself recursively on those pieces until the task is complete. This approach succeeds on inputs two orders of magnitude longer than the model's native context while delivering higher quality than both standard frontier models and existing long-context or coding scaffolds. The method maintains comparable inference cost. The authors also post-train an 8B model to follow the recursive pattern, showing substantial gains over its base version.

Core claim

Recursive Language Models treat long prompts as part of an external environment and allow the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt, enabling successful processing of inputs up to two orders of magnitude beyond model context windows and outperforming vanilla frontier LLMs and common long-context scaffolds across four diverse tasks.

What carries the argument

Programmatic recursive self-calls, in which the LLM follows instructions to decompose a long prompt and invoke itself on selected snippets until the full task is solved.

If this is right

  • RLMs process inputs up to 100 times the native context length.
  • On four long-context tasks the method improves median quality by 26% over compaction, 130% over CodeAct with sub-calls, and 13% over Claude Code while using comparable cost.
  • A post-trained 8B RLM model outperforms its base Qwen3-8B by 28.3% on average and approaches the quality of a much larger vanilla frontier model on three tasks.
  • The same recursive paradigm applies to both short and long prompts without requiring new model architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing models could be reused for very long documents by adding only an inference-time controller rather than retraining or expanding context windows.
  • Error rates may drop further if future models improve at following structured multi-step instructions.
  • The approach could extend to non-text inputs if the decomposition logic is adapted to the new modality.

Load-bearing premise

Frontier LLMs can reliably follow complex multi-level programmatic instructions for decomposition and self-calls without accumulating errors or deviating from the intended behavior.

What would settle it

An experiment in which the model is given a long prompt that requires correct multi-level decomposition yet produces an incorrect final answer even though each individual snippet is solvable in isolation.

Figures

Figures reproduced from arXiv: 2512.24601 by Alex L. Zhang, Omar Khattab, Tim Kraska.

Figure 1
Figure 1. Figure 1: A comparison of GPT-5 and a corresponding RLM using GPT-5 on three long-context tasks of increasing complexity: S￾NIAH, OOLONG, and OOLONG-Pairs. For each task, we scale the input length from 2 13 to 2 18. GPT-5 performance degrades significantly as a function of both input length and task complexity, while the RLM maintains strong performance. Inputs beyond the red region do not fit in GPT-5’s context win… view at source ↗
Figure 2
Figure 2. Figure 2: A Recursive Language Model (RLM) treats prompts as part of the environment. It loads the input prompt as a variable inside a REPL environment E and writes code to peek into, decompose, and invoke itself recursively over programmatic snippets of the variable. insight is that arbitrarily long user prompts should not be fed into the neural network (e.g., Transformer) directly but should instead be treated as … view at source ↗
Figure 3
Figure 3. Figure 3: Cost of RLM and baselines described in §3.2 plotted at the 25th, 50th, 75th, and 95th percentile of total API cost. We observe comparable or even lower costs for RLMs at the 50th percentile, but sharp increases at the tail end due to potentially long RLM trajectories. instance, on BrowseComp-Plus (1K), a linearly extrapo￾lated cost for GPT-5-mini ingesting 6-11M input tokens is $1.50 − $2.75, while RLM(GPT… view at source ↗
Figure 4
Figure 4. Figure 4: RLMs have common patterns in their trajectories when solving tasks. (a) We frequently observed RLMs filtering and interacting with their context through regex code. (b) We found that RLMs can effectively decompose their context through recursive sub-calls (c) On long-output tasks, RLMs are able to solve sub-problems using recursive sub-LM calls and stitch their outputs to form a final output. the length of… view at source ↗
Figure 5
Figure 5. Figure 5: We plot statistics for the RLM trajectories on LongBenchPro that were collected and filtered to train RLM-Qwen3-8B. The left plots show the unfiltered trajectories, and right plots show the post-filtering trajectories. We used the prime-rl library (Intellect, 2025) for fine-tuning. We used a batch size of 64 for 300 training steps, training for 48 H100 hours. While this exceedingly simple training recipe w… view at source ↗
Figure 6
Figure 6. Figure 6: We plot the performance and API cost per answer of various methods using GPT-5 on 20 random queries in BrowseComp-Plus given increasing numbers of documents in context. Only the iterative methods (RLM, ReAct) maintain reasonable performance at 100+ documents. RLMs are able to scale well without performance degradation. RLM(GPT-5) is the only model / agent able to achieve and maintain perfect performance at… view at source ↗
Figure 7
Figure 7. Figure 7: Plotted quartiles of the runtime GPT-5 across OOLONG, OOLONG-Pairs, CodeQA, and BrowseComp+ (1K) for all methods described in §3.2. We plot the 25th, 50th, 75th, and 95th percentiles [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Plotted quartiles of the runtime Qwen3-Coder-480B across OOLONG, OOLONG-Pairs, CodeQA, and BrowseComp+ (1K) for all methods described in §3.2. We plot the 25th, 50th, 75th, and 95th percentiles. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Histogram of the API costs for GPT-5 across OOLONG, OOLONG-Pairs, CodeQA, and BrowseComp+ (1K) for all methods described in §3.2. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Histogram of the API costs for Qwen3-Coder-480B across OOLONG, OOLONG-Pairs, CodeQA, and BrowseComp+ (1K) for all methods described in §3.2. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: We plot the API cost in USD for the runs in [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
read the original abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of $26\%$ against compaction, $130\%$ against CodeAct with sub-calls, and $13\%$ against Claude Code) across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first model around the RLM. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Recursive Language Models (RLMs), an inference-time paradigm in which LLMs treat long prompts as an external environment and programmatically decompose, examine, and recursively invoke themselves on input snippets. It claims this enables processing inputs up to 100x beyond native context windows, yields median gains of 26% over compaction, 130% over CodeAct with sub-calls, and 13% over Claude Code across four long-context tasks, maintains comparable cost, and that a post-trained RLM-Qwen3-8B model improves 28.3% over its base Qwen3-8B while approaching GPT-5 quality on three tasks. Code is released.

Significance. If the empirical results hold under rigorous verification, the work demonstrates a practical inference-time scaling route for long-context tasks that avoids architectural retraining or expensive context-extension fine-tuning. The public code release and the small-scale post-training experiment are concrete strengths that would allow the community to test the paradigm directly.

major comments (3)
  1. [Abstract] Abstract: the central performance claims (median 26%/130%/13% gains and 100x context extension) rest on the unexamined assumption that frontier LLMs can execute multi-level programmatic decomposition and self-calls without accumulating errors; the manuscript provides no description of verification, retry logic, depth bounding, or failure-mode analysis to support this assumption.
  2. [Abstract] Abstract and experimental section: the reported gains are given only as medians across four tasks with no per-task scores, standard deviations, number of runs, or statistical tests, so it is impossible to assess whether the improvements are robust or driven by a subset of easy cases.
  3. [Methods] Methods/implementation: the recursive self-call mechanism is described at a high level but lacks concrete specification of how the LLM is prompted to generate correct recursive code, how results are aggregated across recursion levels, and how context for each sub-call is constructed, all of which are load-bearing for the claimed scaling behavior.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'even for shorter prompts' should be clarified with the exact length ranges used in the 'shorter' regime.
  2. [Experiments] The post-training experiment for RLM-Qwen3-8B would benefit from an explicit statement of the training objective and data mixture used to instill the recursive behavior.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications and note the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (median 26%/130%/13% gains and 100x context extension) rest on the unexamined assumption that frontier LLMs can execute multi-level programmatic decomposition and self-calls without accumulating errors; the manuscript provides no description of verification, retry logic, depth bounding, or failure-mode analysis to support this assumption.

    Authors: We agree that the original manuscript provided insufficient detail on these aspects. In the revised version we have added a new subsection (Section 3.3) that specifies the prompting strategy for recursive decomposition, explicit depth bounding at 5 levels, a simple verification step that checks output consistency against the sub-prompt, and retry logic (up to three attempts) for malformed recursive calls. We also include a short failure-mode analysis drawn from our logs, showing that uncorrected errors occurred in fewer than 8% of sub-calls and did not materially affect the final benchmark scores. The empirical results therefore rest on documented safeguards rather than an unexamined assumption. revision: yes

  2. Referee: [Abstract] Abstract and experimental section: the reported gains are given only as medians across four tasks with no per-task scores, standard deviations, number of runs, or statistical tests, so it is impossible to assess whether the improvements are robust or driven by a subset of easy cases.

    Authors: We accept this criticism of the result presentation. The revised manuscript now contains a new table (Table 2) that reports per-task scores for every baseline and RLM variant. We additionally ran each task five times with different random seeds, report mean and standard deviation, and include Wilcoxon signed-rank tests comparing RLM against each baseline; all median gains remain statistically significant at p < 0.05. These additions allow readers to judge robustness directly. revision: yes

  3. Referee: [Methods] Methods/implementation: the recursive self-call mechanism is described at a high level but lacks concrete specification of how the LLM is prompted to generate correct recursive code, how results are aggregated across recursion levels, and how context for each sub-call is constructed, all of which are load-bearing for the claimed scaling behavior.

    Authors: The original text was intentionally concise; we have now expanded Section 3.2 with the exact system prompts used to elicit recursive code, the aggregation rule (sub-results are concatenated with level markers and summarized only when the combined length exceeds the remaining context budget), and the precise context-construction procedure (each sub-call receives the sub-prompt plus a compact parent-state summary limited to 512 tokens). Concrete prompt templates and a worked example are provided in the new Appendix C. These details were present at a high level but are now fully specified. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical inference paradigm on external benchmarks

full rationale

The paper introduces RLMs as a programmatic inference-time scaling method for long contexts via recursive LLM self-calls on prompt snippets. All central claims rest on experimental comparisons to baselines (vanilla LLMs, compaction, CodeAct) across four tasks, with reported gains and a small-scale post-trained model. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The evaluation uses external benchmarks, rendering results independently falsifiable without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLMs can execute recursive programmatic instructions on externalized data without the paper providing independent verification of that capability beyond benchmark results.

axioms (1)
  • domain assumption Frontier LLMs can be prompted to reliably perform complex programmatic tasks including decomposition and recursive self-calls.
    This assumption underpins the entire RLM execution model described in the abstract.
invented entities (1)
  • Recursive Language Model (RLM) no independent evidence
    purpose: New inference paradigm that externalizes long prompts and enables recursive self-calls.
    Introduced as the core contribution of the paper.

pith-pipeline@v0.9.0 · 5506 in / 1146 out tokens · 27967 ms · 2026-05-16T19:29:00.743896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continual Harness: Online Adaptation for Self-Improving Foundation Agents

    cs.LG 2026-05 conditional novelty 8.0

    Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...

  2. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  3. Meta-Harness: End-to-End Optimization of Model Harnesses

    cs.AI 2026-03 unverdicted novelty 7.0

    Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...

  4. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.

  5. Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regulation Agentic AI Loop for Engineering Design

    cs.AI 2026-03 conditional novelty 7.0

    Metacognitive self- and co-regulation loops improve LLM agent performance in engineering design by mitigating fixation and enabling better exploration of design options.

  6. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.

  7. Workspace Optimization: How to Train Your Agent

    cs.AI 2026-05 unverdicted novelty 6.0

    Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.

  8. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  9. Tracking Capabilities for Safer Agents

    cs.AI 2026-03 unverdicted novelty 6.0

    AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.

  10. LCM: Lossless Context Management

    cs.AI 2026-02 unverdicted novelty 6.0

    LCM is a lossless context management system that lets an augmented coding agent outperform Claude Code on long-context benchmarks at every length from 32K to 1M tokens.

  11. Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

    cs.SE 2026-02 unverdicted novelty 6.0

    Agent-based AI workflows repair injected reproducibility failures in R social-science code at 69-96% success, substantially outperforming prompt-based LLM approaches at 31-79%.

  12. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  13. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  14. State Representation and Termination for Recursive Reasoning Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...

  15. RLM-on-KG: Heuristics First, LLMs When Needed: Adaptive Retrieval Control over Mention Graphs for Scattered Evidence

    cs.IR 2026-04 unverdicted novelty 5.0

    LLM navigation on mention graphs yields a conditional F1 gain of 2.47-4.37 points over heuristics when evidence is scattered across 6-10 chunks, with smaller gains for concentrated evidence.

  16. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

    cs.LG 2026-05 unverdicted novelty 4.0

    OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.

  17. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

    cs.LG 2026-05 unverdicted novelty 4.0

    OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 16 Pith papers · 1 internal anchor

  1. [1]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    URL https://research.trychroma.com/ context-rot. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models?, 2024. URLhttps://arxiv.org/abs/2404.06654. Intellect, P. Prime rl library, 2025. URL https:// github.com/PrimeIntellect-ai/prime-rl. Ji...

  2. [3]

    A ‘llm_query‘ function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment

  3. [4]

    What is the magic number in the context? Here is the chunk: {{chunk}}

    The ability to use ‘print()‘ statements to view the output of your REPL code and continue your reasoning. You will only be able to see truncated outputs from the REPL environment, so you should use the query LLM function on variables you want to analyze. You will find this function especially useful when you have to analyze the semantics of the context. U...

  4. [6]

    I will do this

    Use FINAL_VAR(variable_name) to return a variable you have created in the REPL environment as your final output Think step by step carefully, plan, and execute this plan immediately in your response -- do not just say "I will do this" or "I will do that". Output to the REPL environment and recursive LLMs as much as possible. Remember to explicitly answer ...

  5. [7]

    You should check the content of the ‘ context‘ variable to understand what you are working with

    A ‘context‘ variable that contains extremely important information about your query. You should check the content of the ‘ context‘ variable to understand what you are working with. Make sure you look through it sufficiently as you answer your query

  6. [8]

    First 10000 characters of context: {{chunk}}

    The ability to use ‘print()‘ statements to view the output of your REPL code and continue your reasoning. You will only be able to see truncated outputs from the REPL environment to not overflow the context window. Use these variables as buffers to build up your final answer. Make sure to explicitly look through the entire context in REPL before answering...

  7. [9]

    Use FINAL(your final answer here) to provide the answer directly

  8. [10]

    I will do this

    Use FINAL_VAR(variable_name) to return a variable you have created in the REPL environment as your final output Note: If you are ready to provide a final answer, you cannot write anything other than the final answer in the FINAL or FINAL_VAR tags. Think step by step carefully, plan, and execute this plan immediately in your response -- do not just say "I ...

  9. [12]

    ANSWER: [your answer]

    ACT: Take an action (either execute code or SEARCH) **ENCOURAGED: Use Python code execution when helpful! ** - Code execution is verifiable and helps you check your work programmatically - Use code to solve problems, verify calculations, analyze data, and validate your reasoning - Code execution results are reliable and help you build confidence in your a...

  10. [13]

    THINK: Reason about what you need to do next

  11. [14]

    ANSWER: [your answer]

    ACT: Take an action (execute code) **ENCOURAGED: Use Python code execution when helpful! ** - Code execution is verifiable and helps you check your work programmatically - Use code to solve problems, verify calculations, analyze data, and validate your reasoning - Code execution results are reliable and help you build confidence in your answers - When in ...