KLong: Training LLM Agent for Extremely Long-horizon Tasks
Pith reviewed 2026-05-15 20:45 UTC · model grok-4.3
The pith
KLong shows that trajectory-splitting SFT followed by progressive RL lets a 106B agent outperform a 1T model on extremely long-horizon tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that KLong, trained first with trajectory-splitting supervised fine-tuning on thousands of long-horizon trajectories and then with progressive reinforcement learning across multiple stages of increasing timeouts, surpasses a 1T-parameter model by 11.28 percent on PaperBench while generalizing the gains to other coding benchmarks.
What carries the argument
Trajectory-splitting SFT, which preserves early context, progressively truncates later parts, and maintains sub-trajectory overlap, paired with progressive RL that schedules training into stages with extended timeouts.
If this is right
- The performance advantage transfers to other coding benchmarks such as SWE-bench Verified and MLE-bench.
- Moderate-sized models can sustain agentic behavior over task lengths that previously favored much larger models.
- Staged training with lengthening timeouts allows capability to scale with horizon without immediate degradation.
Where Pith is reading between the lines
- If the recipe works beyond the reported benchmarks, developers could achieve long-horizon agents with fewer total parameters by emphasizing trajectory structure over raw model scale.
- Applying the same splitting and staging pattern to non-coding domains such as multi-step scientific planning would test whether the gains depend on the nature of the tasks.
- Removing the distillation from a stronger model and training only on self-generated trajectories would reveal whether the initial data quality is essential or if the method can bootstrap from weaker starting points.
Load-bearing premise
The distilled trajectories supply enough signal that the splitting step and staged RL training can extend context handling without hidden collapse on long sequences.
What would settle it
Evaluating the trained model on tasks whose required horizon length greatly exceeds the longest training trajectories and finding that its success rate falls below the larger baseline or drops sharply after a fixed length.
read the original abstract
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KLong, an open-source LLM agent for extremely long-horizon tasks. It first cold-starts a base model via trajectory-splitting SFT on data generated by the Research-Factory pipeline (distilling thousands of trajectories from Claude 4.5 Sonnet), then scales via progressive RL with multiple stages of increasing timeouts. The central empirical claim is that KLong (106B) outperforms Kimi K2 Thinking (1T) by 11.28% on PaperBench, with the gains generalizing to SWE-bench Verified and MLE-bench.
Significance. If the performance gains and generalization hold under rigorous validation, the work would represent a meaningful advance in training LLM agents for long-horizon tasks by showing how automated distillation pipelines and staged RL can be combined effectively. The open-source framing and cross-benchmark claims are potentially valuable contributions to the field.
major comments (3)
- Abstract: The 11.28% improvement of KLong (106B) over Kimi K2 Thinking (1T) on PaperBench is presented without any description of the experimental setup, baselines, number of trials, statistical significance, or controls for confounds arising from the Claude 4.5 Sonnet distillation process.
- Abstract: The trajectory-splitting SFT is described only at a high level (preserving early context, progressive truncation, overlap between sub-trajectories); no specifics on the splitting algorithm, overlap size, truncation points, or how long-horizon dependencies are maintained are provided, which is load-bearing for the core methodological claim.
- Abstract: The progressive RL schedule (multiple stages with extended timeouts) lacks any details on stage durations, timeout progression, hyperparameters, or ablations isolating the contribution of each component, preventing assessment of the generalization claims to SWE-bench Verified and MLE-bench.
minor comments (1)
- Abstract: The reference to Figure 1 demonstrating superiority is mentioned but the figure itself is not described or included in the provided manuscript.
Simulated Author's Rebuttal
Thank you for your constructive review and for highlighting areas where the abstract could be strengthened. We agree that additional details on the experimental setup, trajectory-splitting SFT, and progressive RL schedule will improve the manuscript and help readers assess the claims. We will revise the abstract accordingly in the next version. Our point-by-point responses are below.
read point-by-point responses
-
Referee: Abstract: The 11.28% improvement of KLong (106B) over Kimi K2 Thinking (1T) on PaperBench is presented without any description of the experimental setup, baselines, number of trials, statistical significance, or controls for confounds arising from the Claude 4.5 Sonnet distillation process.
Authors: We agree that the abstract should convey more about the evaluation protocol. In the revised version we will expand the abstract to briefly describe the PaperBench task distribution, the primary baselines (including Kimi K2 Thinking), the number of evaluation runs, the statistical tests performed, and the controls used to isolate effects of the Research-Factory distillation pipeline from Claude 4.5 Sonnet. These additions will be drawn from the experimental section while keeping the abstract concise. revision: yes
-
Referee: Abstract: The trajectory-splitting SFT is described only at a high level (preserving early context, progressive truncation, overlap between sub-trajectories); no specifics on the splitting algorithm, overlap size, truncation points, or how long-horizon dependencies are maintained are provided, which is load-bearing for the core methodological claim.
Authors: We acknowledge the abstract currently gives only a high-level characterization. We will revise it to include concrete parameters of the splitting procedure—such as the overlap size, the progressive truncation schedule, and the mechanism that retains early context to preserve long-horizon dependencies—while still directing readers to the full algorithmic description and pseudocode in the methods section. revision: yes
-
Referee: Abstract: The progressive RL schedule (multiple stages with extended timeouts) lacks any details on stage durations, timeout progression, hyperparameters, or ablations isolating the contribution of each component, preventing assessment of the generalization claims to SWE-bench Verified and MLE-bench.
Authors: We agree that the abstract omits necessary specifics on the staged RL curriculum. In the revision we will add a concise description of the stage durations, the timeout progression schedule, key hyperparameters, and reference to the ablations that isolate each stage’s contribution. This will better support the reported generalization to SWE-bench Verified and MLE-bench. revision: yes
Circularity Check
No circularity; empirical training and benchmark claims
full rationale
The abstract describes a procedural training pipeline (trajectory-splitting SFT followed by progressive RL on distilled trajectories) and reports empirical benchmark results such as the 11.28% gain on PaperBench. No equations, derivations, fitted parameters, or self-referential definitions appear. Claims rest on direct model comparisons rather than any reduction of outputs to inputs by construction. No load-bearing steps of the enumerated kinds exist, matching the default expectation of no significant circularity for an empirical methods paper.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.