pith. sign in

arxiv: 2606.30616 · v1 · pith:GHUP7TQ7new · submitted 2026-06-29 · 💻 cs.CL

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Pith reviewed 2026-06-30 05:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords agent horizon scaling35B MoE modellong-horizon trajectoriesmulti-teacher distillationagent benchmarksknowledge-action infrastructuredomain-routed training
0
0 comments X

The pith

A 35B agent reaches trillion-parameter performance on long-horizon tasks by scaling trajectories instead of model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a 35 billion parameter Mixture-of-Experts model can match or exceed 1T-parameter models on agent benchmarks by extending the length and diversity of its reasoning paths. It constructs a knowledge-action infrastructure that generates trajectories averaging 45K tokens linking external knowledge, actions, observations, and verifiers. A three-stage training process first aligns the base model through full-domain supervised fine-tuning, then creates specialized domain-level teachers, and finally applies multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to unify six domains in one student model. If this holds, it offers a route to strong agent performance without the compute cost of trillion-scale parameter counts.

Core claim

Agents-A1, a 35B Mixture-of-Experts model trained via a three-stage recipe on long-horizon trajectories averaging 45K tokens, achieves leading scores on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8) while remaining competitive on SciCode (44.3), HLE (47.6), and BrowseComp (75.5) against 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.

What carries the argument

The long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes to produce representative agentic trajectories.

Load-bearing premise

The generated long-horizon trajectories are assumed to represent real deployment conditions and to transfer across the six domains without overfitting or benchmark leakage.

What would settle it

A new long-horizon agent benchmark constructed after the training data cutoff, with no trajectory overlap, would show whether Agents-A1 maintains its reported advantage over the 1T models.

read the original abstract

We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal, we build a long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose a multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agents-A1, a 35B Mixture-of-Experts agentic model that reaches performance levels comparable to 1T-parameter models on long-horizon tasks by scaling agent horizons rather than parameters. It describes a knowledge-action infrastructure generating trajectories averaging 45K tokens, followed by a three-stage training process (full-domain SFT, domain-specific teachers, and multi-teacher on-policy distillation with vocabulary alignment) that unifies six heterogeneous domains into a single deployable model. The central empirical claim is that Agents-A1 leads or matches 1T models on benchmarks including SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), MolBench-Bind (56.8), while remaining competitive on SciCode, HLE, and BrowseComp.

Significance. If the performance claims hold after proper verification, the work would provide evidence that horizon scaling via long trajectories and multi-domain distillation can be more efficient than parameter scaling for agentic capabilities, offering a practical route to high-performance agents on smaller models. The explicit infrastructure for 45K-token trajectories and the three-stage recipe would constitute reusable contributions if accompanied by sufficient controls and ablations.

major comments (3)
  1. [Abstract] Abstract: The reported benchmark scores (e.g., SEAL-0 56.4, IFBench 80.6) are presented without any description of evaluation protocols, controls for data leakage from the 45K-token trajectories, error bars, or statistical significance testing. This information is load-bearing for the central claim that Agents-A1 matches or exceeds 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.
  2. [Training recipe and infrastructure] Training and infrastructure description: No ablation is reported that isolates the contribution of the long-horizon (45K-token) trajectories from the three-stage recipe or that verifies the trajectories were generated without including or paraphrasing items from the six evaluation benchmarks. Without such controls, the cross-domain generalization claim rests on an unverified assumption.
  3. [Results and comparison] Benchmark comparison: The headline results against 1T models are stated as leading on five benchmarks, yet the manuscript supplies no details on whether the evaluation sets were held out from the knowledge-action infrastructure data or on any contamination audit. This directly affects the validity of the horizon-scaling thesis.
minor comments (2)
  1. [Abstract] The abstract and introduction use the term 'leading results' without defining the precise ranking criteria or listing all competing models evaluated.
  2. [Method] Notation for the multi-teacher distillation step (vocabulary alignment) is introduced at a high level; a concrete equation or pseudocode would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency on evaluation protocols, ablations, and contamination controls. These points are important for strengthening the central claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported benchmark scores (e.g., SEAL-0 56.4, IFBench 80.6) are presented without any description of evaluation protocols, controls for data leakage from the 45K-token trajectories, error bars, or statistical significance testing. This information is load-bearing for the central claim that Agents-A1 matches or exceeds 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.

    Authors: We agree that the abstract lacks these details. In the revision we will expand the evaluation section to describe the protocols used for each benchmark, steps taken to mitigate data leakage from the trajectory data, and any available statistical information. Error bars were not computed owing to the prohibitive cost of repeated full evaluations; we will explicitly note this limitation and discuss observed variance across domains where feasible. revision: yes

  2. Referee: [Training recipe and infrastructure] Training and infrastructure description: No ablation is reported that isolates the contribution of the long-horizon (45K-token) trajectories from the three-stage recipe or that verifies the trajectories were generated without including or paraphrasing items from the six evaluation benchmarks. Without such controls, the cross-domain generalization claim rests on an unverified assumption.

    Authors: We acknowledge the absence of a dedicated ablation separating trajectory length from the three-stage recipe. We will add such an ablation comparing shorter- versus full-length trajectories. We will also document the data-generation pipeline and any decontamination procedures applied to ensure the 45K-token trajectories do not contain or paraphrase benchmark items. revision: yes

  3. Referee: [Results and comparison] Benchmark comparison: The headline results against 1T models are stated as leading on five benchmarks, yet the manuscript supplies no details on whether the evaluation sets were held out from the knowledge-action infrastructure data or on any contamination audit. This directly affects the validity of the horizon-scaling thesis.

    Authors: We will add a dedicated subsection confirming that all evaluation sets were held out from the knowledge-action infrastructure and describing the contamination audit performed. These details will be placed in the results section to support the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims rest on described training stages without self-referential reductions or load-bearing self-citations

full rationale

The provided manuscript text (abstract plus context) describes a three-stage training recipe (full-domain SFT, domain teachers, multi-teacher distillation) and reports benchmark scores as outcomes of the long-horizon infrastructure. No equations, fitted parameters renamed as predictions, or self-citation chains appear that would make any result equivalent to its inputs by construction. The central claim is an empirical comparison to 1T models; absent any derivation that collapses to the input data or prior self-work, the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 6031 in / 1243 out tokens · 30300 ms · 2026-06-30T05:49:39.484221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 24 canonical work pages · 16 internal anchors

  1. [1]

    Kimi.Kimi K2.6: Advancing Open-Source Coding.https://www.kimi.com/blog/kimi-k2-

  2. [2]

    https://openai.com/index/introducing- gpt- 5- 5

    OpenAI.Introducing GPT-5.5. https://openai.com/index/introducing- gpt- 5- 5 . 2026

  3. [3]

    Anthropic.Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude- opus-4-6. 2026

  4. [4]

    Gemini 3 Pro - Google DeepMind.url: https://deepmind.google/models/gemini/ pro/

  5. [5]

    DeepSeek-AI.DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026

  6. [6]

    Qwen.Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5. 2026

  7. [7]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:arXiv preprint arXiv:2410.07095(2024)

  8. [8]

    OpenAI.FrontierScience: Evaluating AI’s Ability To Perform Expert-level Scientific Tasks.https: //openai.com/index/frontierscience/. 2026

  9. [9]

    Humanity's Last Exam

    Long Phan et al. “Humanity’s last exam”. In:arXiv preprint arXiv:2501.14249(2025)

  10. [10]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei et al. “Browsecomp: A simple yet challenging benchmark for browsing agents”. In: arXiv preprint arXiv:2504.12516(2025)

  11. [11]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng et al. “Glm-5: from vibe coding to agentic engineering”. In:arXiv preprint arXiv:2602.15763(2026)

  12. [12]

    Towards Long-horizon Agentic Multimodal Search

    Yifan Du et al. “Towards Long-horizon Agentic Multimodal Search”. In:arXiv preprint arXiv:2604.12890(2026)

  13. [13]

    Zhipu AI.GLM-5.2: Built for Long-Horizon Tasks.https://z.ai/blog/glm-5.2. 2026

  14. [14]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team et al. “Kimi K2. 5: Visual Agentic Intelligence”. In:arXiv preprint arXiv:2602.02276 (2026)

  15. [15]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Weiwei Sun et al. “Scaling long-horizon llm agent via context-folding”. In:arXiv preprint arXiv:2510.11967(2025)

  16. [16]

    InFindings of the Asso- ciation for Computational Linguistics: ACL 2025, pages 25167–25188, Vienna, Austria

    Jade Copet et al. “Cwm: An open-weights llm for research on code generation with world models”. In:arXiv preprint arXiv:2510.02387(2025)

  17. [17]

    SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

    Thinh Pham et al. “SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models”. In:arXiv preprint arXiv:2506.01062(2025)

  18. [18]

    Generalizing verifiable instruction following

    Valentina Pyatkin et al. “Generalizing verifiable instruction following”. In:Advances in Neural Information Processing Systems38 (2026)

  19. [19]

    Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark?arXiv preprint arXiv:2509.07894,

    Fangchen Yu et al. “HiPhO: How Far Are (M) LLMs from Humans in the Latest High School Physics Olympiad Benchmark?” In:arXiv preprint arXiv:2509.07894(2025)

  20. [20]

    MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

    Lisheng Zhang et al. “MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization”. In:arXiv preprint arXiv:2604.21937 (2026). 26

  21. [21]

    SciCode: A Research Coding Benchmark Curated by Scientists

    Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems. Ed. by A. Globerson et al. Vol. 37. Cur- ran Associates, Inc., 2024, pp. 30624–30650.doi: 10 . 52202 / 079017 - 0963.url: https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / file / 36850592258c8c41cec...

  22. [22]

    Agents-K1: Towards Agent-native Knowledge Orchestration

    Zongsheng Cao et al. “Agents-K1: Towards Agent-native Knowledge Orchestration”. In:arXiv preprint arXiv:2606.13669(2026)

  23. [23]

    On-Policy Distillation

    Kevin Lu and Thinking Machines Lab. “On-Policy Distillation”. In:Thinking Machines Lab: Con- nectionism(2025). https://thinkingmachines.ai/blog/on-policy-distillation.doi: 10.64434/ tml.20251026

  24. [24]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu et al. “Revisiting on-policy distillation: Empirical failure modes and simple fixes”. In:arXiv preprint arXiv:2603.25562(2026)

  25. [25]

    MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

    Rushi Qiang et al. “MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering”. In:arXiv preprint arXiv:2505.07782(2025)

  26. [26]

    MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

    Shangheng Du et al. “MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery”. In:arXiv preprint arXiv:2606.06473(2026)

  27. [27]

    GitHub repository

    NVIDIA.NeMo Gym: An Open Source Library for Scaling Reinforcement Learning Environments for LLM.https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025

  28. [28]

    WildChat: 1M ChatGPT Interaction Logs in the Wild

    Wenting Zhao et al. “WildChat: 1M ChatGPT Interaction Logs in the Wild”. In:The Twelfth International Conference on Learning Representations. 2024.url:https://openreview. net/forum?id=Bl8u7ZRlbM

  29. [29]

    Victor Barres et al.𝜏2-Bench: Evaluating Conversational Agents in a Dual-Control Environment

  30. [30]

    arXiv:2506.07982 [cs.AI].url:https://arxiv.org/abs/2506.07982

  31. [31]

    VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications

    Wei He et al. “VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications”. In:arXiv preprint arXiv:2509.26490(2025)

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models”. In:arXiv preprint arXiv:2402.03300(2024)

  33. [33]

    ZelinTanetal.PAPO:StabilizingRubricIntegrationTrainingviaDecoupledAdvantageNormaliza- tion. 2026. arXiv:2603.26535 [cs.AI].url:https://arxiv.org/abs/2603.26535

  34. [34]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon et al. “Gaia: a benchmark for general ai assistants”. In:International Conference on Learning Representations. Vol. 2024. 2024, pp. 9025–9049

  35. [35]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations

    Kaiyuan Chen et al. “xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations”. In:arXiv preprint arXiv:2506.13651(2025)

  36. [36]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 3639–3664

  37. [37]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou et al. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911(2023)

  38. [38]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu et al. “Deepseek-v3. 2: Pushing the frontier of open large language models”. In:arXiv preprint arXiv:2512.02556(2025)

  39. [39]

    Siyu Liu et al.MatTools: Benchmarking Large Language Models for Materials Science Tools. 2025. arXiv: 2505 . 10852 [cond-mat.mtrl-sci].url: https : / / arxiv . org / abs / 2505 . 10852. 27

  40. [40]

    2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)

    ICML 2013 Workshop on Machine Learning for Bioacoustics.Challenge Description. 2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)

  41. [41]

    The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data

    Kenneth R. Knapp et al. “The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data”. In:Bulletin of the American Meteorological Society(2010).doi:10.1175/2009BAMS2755.1

  42. [42]

    International Best Track Archive for Climate Stewardship (IBTrACS) Project

    J. Gahtan et al. “International Best Track Archive for Climate Stewardship (IBTrACS) Project”. In:NOAANationalCentersforEnvironmentalInformation(2024).doi: 10.25921/82ty-9e16

  43. [43]

    2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf

    Joint Typhoon Warning Center.Annual Tropical Cyclone Report 2008. 2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf

  44. [44]

    Updated 2025-09-23

    NOAANationalCentersforEnvironmentalInformation.IBTrACSv04r01ColumnDocumentation. Updated 2025-09-23. 2025.url: https : / / www . ncei . noaa . gov / sites / default / files/2025-09/IBTrACS_v04r01_column_documentation.pdf. 28 A. Appendix A.1. Contributions and Acknowledgments Knowledge-Action Infrastructure:Zongsheng Cao†, Bihao Zhan, Zhijie Zhong Full-dom...