Toward Autonomous Long-Horizon Engineering for ML Research

Cheng Chen; Fanzhe Meng; Guoxin Chen; Jiale Zhao; Jie Chen; Ji-Rong Wen; Kai Jia; Lei Chen; Ruihua Song; Wayne Xin Zhao

arxiv: 2604.13018 · v2 · pith:ZUALWONYnew · submitted 2026-04-14 · 💻 cs.CL

Toward Autonomous Long-Horizon Engineering for ML Research

Guoxin Chen , Jie Chen , Lei Chen , Jiale Zhao , Fanzhe Meng , Wayne Xin Zhao , Ruihua Song , Cheng Chen

show 2 more authors

Ji-Rong Wen Kai Jia

This is my paper

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords autonomous research agentslong-horizon taskshierarchical orchestrationdurable stateFile-as-BusML engineeringPaperBenchMLE-Bench

0 comments

The pith

AiScientist achieves higher performance on long-horizon ML research benchmarks by using hierarchical orchestration and a File-as-Bus workspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that autonomous agents can handle the full cycle of ML research engineering over long periods when given both a hierarchical structure for direction and a durable file-based workspace for maintaining state. This matters because typical agent setups lose coherence as tasks stretch across setup, coding, testing, and iteration. AiScientist uses an orchestrator to track stages with summaries and maps, while agents repeatedly consult persistent files holding plans, code, and results instead of depending on chat history. Results on PaperBench and MLE-Bench Lite support the design, and removing the file protocol hurts scores markedly. If the approach holds, it points to treating extended research as coordination across shared artifacts rather than isolated reasoning steps.

Core claim

We present AiScientist as a system for long-horizon ML research engineering that integrates hierarchical orchestration with a permission-scoped File-as-Bus workspace. The orchestrator exerts thin control by issuing concise summaries and maintaining a workspace map, while specialized agents re-ground their work on durable artifacts including analyses, plans, code, and experimental evidence. This architecture produces coherent multi-stage progress and delivers measurable gains: an average 10.54-point improvement on PaperBench over the strongest baseline and 81.82 Any Medal% on MLE-Bench Lite. Ablation experiments identify the File-as-Bus protocol as a primary contributor to these outcomes.

What carries the argument

The File-as-Bus workspace under hierarchical orchestration: agents exchange and persist project state through files rather than conversation, with an orchestrator providing high-level direction via summaries and maps.

Load-bearing premise

The benchmarks used reflect real-world long-horizon ML research demands and the performance differences arise chiefly from the proposed orchestration and File-as-Bus components.

What would settle it

An experiment showing that a baseline agent with only conversational memory achieves similar scores on PaperBench and MLE-Bench Lite, or a new benchmark where the AiScientist design fails to maintain progress over longer periods.

Figures

Figures reproduced from arXiv: 2604.13018 by Cheng Chen, Fanzhe Meng, Guoxin Chen, Jiale Zhao, Jie Chen, Ji-Rong Wen, Kai Jia, Lei Chen, Ruihua Song, Wayne Xin Zhao.

**Figure 2.** Figure 2: Architecture of AiScientist, an artifact-mediated research lab. A Tier-0 Orchestrator keeps [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Mechanism analysis of AiScientist under GLM-5. Left: AiScientist outperforms both a [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AiScientist combines hierarchical orchestration with a durable File-as-Bus workspace to support long-horizon ML research agents and reports concrete benchmark gains, though the attribution to those mechanisms rests on partially described controls.

read the letter

The paper's main contribution is a systems-level architecture for autonomous ML research engineering. AiScientist uses a top-level orchestrator that maintains stage summaries and a workspace map, while specialized agents operate on a permission-scoped file workspace that serves as persistent state. This is meant to avoid the fragility of pure conversational handoffs over hours or days of work. The approach is tested on PaperBench and MLE-Bench Lite, where it shows average gains of 10.54 points and an 81.82% Any Medal rate, with ablations indicating that removing the file protocol drops performance by 6.41 and 31.82 points respectively.

Referee Report

1 major / 1 minor

Summary. The paper proposes AiScientist, a system for autonomous long-horizon engineering in ML research. It combines hierarchical orchestration, where a top-level Orchestrator uses concise summaries and a workspace map for stage-level control, with specialized agents that rely on a durable File-as-Bus workspace for state continuity instead of conversational handoffs. Evaluations on PaperBench and MLE-Bench Lite show an average 10.54 point improvement on PaperBench over the best matched baseline and 81.82 Any Medal% on MLE-Bench Lite. Ablations indicate that removing the File-as-Bus protocol reduces scores by 6.41 on PaperBench and 31.82 on MLE-Bench Lite.

Significance. Should the results prove robust under controlled conditions, the work is significant in demonstrating that long-horizon ML research tasks benefit from systems-level designs emphasizing structured coordination and persistent state management. The explicit use of benchmarks with reported ablations strengthens the case for this approach over purely reasoning-focused methods.

major comments (1)

The manuscript states that baselines are 'best matched' and reports ablation results for File-as-Bus removal, but does not detail whether the ablation maintains identical agent sets, model choices, total token budgets, and interaction limits as the full AiScientist system. This information is necessary to attribute the performance differences specifically to the hierarchical orchestration and File-as-Bus design rather than other implementation factors.

minor comments (1)

The abstract could specify the number of experimental runs or include variance measures for the reported average improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to provide the requested experimental controls.

read point-by-point responses

Referee: The manuscript states that baselines are 'best matched' and reports ablation results for File-as-Bus removal, but does not detail whether the ablation maintains identical agent sets, model choices, total token budgets, and interaction limits as the full AiScientist system. This information is necessary to attribute the performance differences specifically to the hierarchical orchestration and File-as-Bus design rather than other implementation factors.

Authors: We agree that the manuscript should explicitly document these controls to allow readers to attribute the ablation results to the File-as-Bus protocol. In the revised version we will add a dedicated paragraph in the Experiments section (and update the ablation table caption) stating that the File-as-Bus ablation uses identical agent sets, the same model choices and backends, the same total token budgets, and the same interaction limits as the full AiScientist system. This clarification will be added without altering any reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential loops

full rationale

The paper describes an implemented system (AiScientist) and reports measured performance on external benchmarks (PaperBench, MLE-Bench Lite) plus ablation deltas. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims reduce to observed scores rather than any quantity defined in terms of itself or smuggled via prior author work. Attribution concerns (baseline matching, component isolation) are experimental-validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the described architecture in the given benchmarks. No numerical free parameters are introduced. The main domain assumption is that agents can reliably re-ground on file artifacts for continuity.

axioms (1)

domain assumption Specialized agents can effectively re-ground on durable file artifacts such as analyses, plans, code, and experimental evidence
Invoked to justify why File-as-Bus yields thin control over thick state and long-horizon coherence.

invented entities (1)

File-as-Bus workspace no independent evidence
purpose: Provide permission-scoped durable state continuity across agent interactions
New protocol introduced by the paper to replace primary reliance on conversational handoffs.

pith-pipeline@v0.9.0 · 5560 in / 1284 out tokens · 85650 ms · 2026-05-10T15:46:53.395337+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
GEAR: Genetic AutoResearch for Agentic Code Evolution
cs.NE 2026-05 unverdicted novelty 5.0

GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.