arxiv: 2602.17547 · v3 · submitted 2026-02-19 · 💻 cs.AI · cs.CL

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu , Yingwei Ma , Yibo Miao , Yanhao Li , Yuchong Xie , Xinlong Yang , Zhiyuan Hu , Flood Sung

show 2 more authors

Jiaheng Zhang Bryan Hooi

This is my paper

Pith reviewed 2026-05-15 20:45 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM agentlong-horizon taskstrajectory splittingsupervised fine-tuningprogressive reinforcement learningPaperBench

0 comments

The pith

KLong shows that trajectory-splitting SFT followed by progressive RL lets a 106B agent outperform a 1T model on extremely long-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a base LLM can acquire strong performance on tasks requiring sustained action over very long sequences through a two-part training process. The first part uses supervised fine-tuning on split versions of extended trajectories to activate basic agent skills while preserving early context and adding overlaps between segments. The second part applies reinforcement learning in successive stages that gradually lengthen the allowed task duration. A sympathetic reader would care because the approach suggests a concrete path for building capable open-source agents that do not depend on the largest available model sizes.

Core claim

The central claim is that KLong, trained first with trajectory-splitting supervised fine-tuning on thousands of long-horizon trajectories and then with progressive reinforcement learning across multiple stages of increasing timeouts, surpasses a 1T-parameter model by 11.28 percent on PaperBench while generalizing the gains to other coding benchmarks.

What carries the argument

Trajectory-splitting SFT, which preserves early context, progressively truncates later parts, and maintains sub-trajectory overlap, paired with progressive RL that schedules training into stages with extended timeouts.

If this is right

The performance advantage transfers to other coding benchmarks such as SWE-bench Verified and MLE-bench.
Moderate-sized models can sustain agentic behavior over task lengths that previously favored much larger models.
Staged training with lengthening timeouts allows capability to scale with horizon without immediate degradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the recipe works beyond the reported benchmarks, developers could achieve long-horizon agents with fewer total parameters by emphasizing trajectory structure over raw model scale.
Applying the same splitting and staging pattern to non-coding domains such as multi-step scientific planning would test whether the gains depend on the nature of the tasks.
Removing the distillation from a stronger model and training only on self-generated trajectories would reveal whether the initial data quality is essential or if the method can bootstrap from weaker starting points.

Load-bearing premise

The distilled trajectories supply enough signal that the splitting step and staged RL training can extend context handling without hidden collapse on long sequences.

What would settle it

Evaluating the trained model on tasks whose required horizon length greatly exceeds the longest training trajectories and finding that its success rate falls below the larger baseline or drops sharply after a fixed length.

read the original abstract

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract outlines a splitting SFT plus progressive RL pipeline for long-horizon agents with reported gains over a larger model, but the lack of implementation and evaluation details leaves the claims hard to assess.

read the letter

The abstract presents KLong as a 106B model trained first with trajectory-splitting supervised fine-tuning, then scaled with progressive reinforcement learning that extends timeouts across stages. It reports that this beats Kimi K2 Thinking (1T) by 11.28% on PaperBench and generalizes to SWE-bench Verified and MLE-bench. The core idea is to handle extremely long trajectories by splitting them while preserving early context and overlap, plus using an automated Research-Factory pipeline to distill data from Claude 4.5 Sonnet on research papers with rubrics.

Referee Report

3 major / 1 minor

Summary. The paper introduces KLong, an open-source LLM agent for extremely long-horizon tasks. It first cold-starts a base model via trajectory-splitting SFT on data generated by the Research-Factory pipeline (distilling thousands of trajectories from Claude 4.5 Sonnet), then scales via progressive RL with multiple stages of increasing timeouts. The central empirical claim is that KLong (106B) outperforms Kimi K2 Thinking (1T) by 11.28% on PaperBench, with the gains generalizing to SWE-bench Verified and MLE-bench.

Significance. If the performance gains and generalization hold under rigorous validation, the work would represent a meaningful advance in training LLM agents for long-horizon tasks by showing how automated distillation pipelines and staged RL can be combined effectively. The open-source framing and cross-benchmark claims are potentially valuable contributions to the field.

major comments (3)

Abstract: The 11.28% improvement of KLong (106B) over Kimi K2 Thinking (1T) on PaperBench is presented without any description of the experimental setup, baselines, number of trials, statistical significance, or controls for confounds arising from the Claude 4.5 Sonnet distillation process.
Abstract: The trajectory-splitting SFT is described only at a high level (preserving early context, progressive truncation, overlap between sub-trajectories); no specifics on the splitting algorithm, overlap size, truncation points, or how long-horizon dependencies are maintained are provided, which is load-bearing for the core methodological claim.
Abstract: The progressive RL schedule (multiple stages with extended timeouts) lacks any details on stage durations, timeout progression, hyperparameters, or ablations isolating the contribution of each component, preventing assessment of the generalization claims to SWE-bench Verified and MLE-bench.

minor comments (1)

Abstract: The reference to Figure 1 demonstrating superiority is mentioned but the figure itself is not described or included in the provided manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive review and for highlighting areas where the abstract could be strengthened. We agree that additional details on the experimental setup, trajectory-splitting SFT, and progressive RL schedule will improve the manuscript and help readers assess the claims. We will revise the abstract accordingly in the next version. Our point-by-point responses are below.

read point-by-point responses

Referee: Abstract: The 11.28% improvement of KLong (106B) over Kimi K2 Thinking (1T) on PaperBench is presented without any description of the experimental setup, baselines, number of trials, statistical significance, or controls for confounds arising from the Claude 4.5 Sonnet distillation process.

Authors: We agree that the abstract should convey more about the evaluation protocol. In the revised version we will expand the abstract to briefly describe the PaperBench task distribution, the primary baselines (including Kimi K2 Thinking), the number of evaluation runs, the statistical tests performed, and the controls used to isolate effects of the Research-Factory distillation pipeline from Claude 4.5 Sonnet. These additions will be drawn from the experimental section while keeping the abstract concise. revision: yes
Referee: Abstract: The trajectory-splitting SFT is described only at a high level (preserving early context, progressive truncation, overlap between sub-trajectories); no specifics on the splitting algorithm, overlap size, truncation points, or how long-horizon dependencies are maintained are provided, which is load-bearing for the core methodological claim.

Authors: We acknowledge the abstract currently gives only a high-level characterization. We will revise it to include concrete parameters of the splitting procedure—such as the overlap size, the progressive truncation schedule, and the mechanism that retains early context to preserve long-horizon dependencies—while still directing readers to the full algorithmic description and pseudocode in the methods section. revision: yes
Referee: Abstract: The progressive RL schedule (multiple stages with extended timeouts) lacks any details on stage durations, timeout progression, hyperparameters, or ablations isolating the contribution of each component, preventing assessment of the generalization claims to SWE-bench Verified and MLE-bench.

Authors: We agree that the abstract omits necessary specifics on the staged RL curriculum. In the revision we will add a concise description of the stage durations, the timeout progression schedule, key hyperparameters, and reference to the ablations that isolate each stage’s contribution. This will better support the reported generalization to SWE-bench Verified and MLE-bench. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and benchmark claims

full rationale

The abstract describes a procedural training pipeline (trajectory-splitting SFT followed by progressive RL on distilled trajectories) and reports empirical benchmark results such as the 11.28% gain on PaperBench. No equations, derivations, fitted parameters, or self-referential definitions appear. Claims rest on direct model comparisons rather than any reduction of outputs to inputs by construction. No load-bearing steps of the enumerated kinds exist, matching the default expectation of no significant circularity for an empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; training details are high-level descriptions of standard LLM fine-tuning and RL techniques.

pith-pipeline@v0.9.0 · 5523 in / 1287 out tokens · 29879 ms · 2026-05-15T20:45:41.123186+00:00 · methodology