FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Benfeng Xu; Chiwei Zhu; Mingxuan Du; Shaohan Wang; Xiaorui Wang; Yongdong Zhang; Zhendong Mao

arxiv: 2602.01566 · v2 · submitted 2026-02-02 · 💻 cs.CL

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Chiwei Zhu , Benfeng Xu , Mingxuan Du , Shaohan Wang , Xiaorui Wang , Zhendong Mao , Yongdong Zhang This is my paper

Pith reviewed 2026-05-16 08:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentstest-time scalinglong-horizon tasksfile systemdeep researchcontext windowexternal memorymulti-agent framework

0 comments

The pith

A file-system-based dual-agent system lets large language models conduct deep research beyond their context windows by using persistent external memory for knowledge accumulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the core problem that long research trajectories exceed LLM context limits, forcing trade-offs between evidence gathering and report generation that block effective scaling. FS-Researcher splits responsibilities: a Context Builder agent browses sources, writes structured notes, and archives raw material into a hierarchical knowledge base stored on disk, while a Report Writer agent generates the final report section by section from that base. The file system supplies durable shared storage that survives beyond any single context window and supports iterative refinement across sessions. Experiments on DeepResearch Bench and DeepConsult show state-of-the-art report quality across backbone models, with measurable gains when more test-time compute is allocated to the Context Builder.

Core claim

FS-Researcher is a file-system-based, dual-agent framework in which a Context Builder agent acts as a librarian that browses the internet, produces structured notes, and archives raw sources into a hierarchical knowledge base that grows far beyond context length, while a Report Writer agent composes the final report section by section by treating the knowledge base as its factual source. The file system functions as durable external memory and a shared coordination medium, enabling iterative refinement and test-time scaling that would otherwise be impossible inside a single context window. On two open-ended benchmarks the resulting reports reach state-of-the-art quality that improves in lock

What carries the argument

The file system as durable external memory and shared coordination medium between a Context Builder agent that populates a hierarchical knowledge base and a Report Writer agent that consumes it for report generation.

If this is right

Final report quality improves with greater computation allocated to the Context Builder agent.
The framework delivers state-of-the-art report quality on DeepResearch Bench and DeepConsult across different backbone models.
Long research trajectories can be managed without forcing evidence collection and report writing to compete inside a single context window.
Agents gain the ability to iterate and refine work across multiple sessions through the persistent shared workspace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same persistent-storage pattern could support other long-horizon agent tasks such as multi-step code development or experimental planning that require retaining large bodies of intermediate results.
Substituting the file system with alternative durable stores such as vector databases or versioned object stores might retain the scaling benefit while changing the error profile.
Scaling will eventually be limited by the accuracy and organization of the accumulated knowledge base rather than by raw context length.

Load-bearing premise

The file system reliably stores and retrieves structured notes and sources without introducing retrieval errors, coordination failures, or data inconsistencies that would degrade agent performance.

What would settle it

An experiment in which increasing the compute budget allocated to the Context Builder produces no improvement or a decline in final report quality because of accumulated retrieval errors or inconsistencies in the file-system knowledge base.

Figures

Figures reproduced from arXiv: 2602.01566 by Benfeng Xu, Chiwei Zhu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Yongdong Zhang, Zhendong Mao.

**Figure 1.** Figure 1: Different deep research paradigms: (1) Top: Static pipelines and naive single agents that put raw observations in the context; (2) Middle: Agents whose trajectories are extended by compressing the observations, while still bounded by the hard context limit; (3) Bottom: FS-Researcher, an agent framework built on top of an external file system workspace with unlimited context size. Recent progress in coding… view at source ↗

**Figure 2.** Figure 2: The framework of FS-Researcher. Workflow. FS-Researcher adopts a standard ReAct architecture for each agent, which can be formulated as follows: Ti , Ai = Mθ(Tj<i, Aj<i, Oj<i, P) (1) Oi = Execute(Ai) (2) Ti , Ai , Oi are the thought, action, and observation at the i-th step, respectively. Mθ is the model with parameters θ. P is the prompt (system prompt and user query). Execute(Ai) is the tool implementati… view at source ↗

**Figure 3.** Figure 3: Knowledge base example. The deliverables of this agent include one file (index.md) and two directories (knowledge_base/ and sources/). The index.md is like the “Table of Content” of the KB, which contains two parts: (1) the deconstruction of the research topic, and (2) the hierarchical structure of the KB. From the index.md, the agent or human collaborators can get an overview of what the KB is built for a… view at source ↗

**Figure 4.** Figure 4: Left: KB statistics under 3-10 rounds of context-building. The number of characters in report corresponds to the y-axis on the right. Right: DeepResearch Bench scores of FS-Researcher with 3-10 rounds of context building. invested in building a higher-quality knowledge base translates into better final reports. The original scores are listed in Appendix E. Interestingly, Readability peaks at 5 rounds (51.9… view at source ↗

**Figure 5.** Figure 5: Tool usage heatmap for the Context Building stage (first three iterations). [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Tool usage heatmap for the Report Writing stage (first three iterations). [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are open-sourced at https://github.com/Ignoramus0817/FS-Researcher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FS-Researcher gives a practical dual-agent file-system split for long research tasks that shows scaling with builder compute and comes with open code, but the memory reliability claim rests on untested assumptions.

read the letter

The core contribution is a split where one agent browses, writes structured notes, and archives sources into a growing hierarchical file system while the second agent pulls from those files section by section to build the report. This setup is meant to keep total context manageable on long-horizon tasks. The experiments on DeepResearch Bench and DeepConsult report better final report quality than prior methods across backbones, plus a positive link between extra compute on the Context Builder and output quality. The code and data release is a clear plus for anyone who wants to run or extend the system themselves. That combination of architecture and open artifacts is what makes the work usable right away in the agent engineering space. The main weakness is that the file system is presented as reliable shared memory without supporting measurements. No numbers appear on retrieval precision, write consistency, or how often concurrent access creates mismatches. If those errors are common, they could quietly degrade the reported gains. The abstract also leaves the exact quality metrics, full baseline details, and token-budget controls unspecified, so it is hard to tell how much of the improvement comes from the file-system design versus other factors. This is aimed at people already working on LLM agents for research or consulting workflows. A reader in that group can pull the code and test the correlation claim directly. The paper deserves peer review because the idea is concrete, the benchmarks are public, and the open release lets referees check the implementation themselves rather than take the abstract at face value.

Referee Report

2 major / 2 minor

Summary. The paper introduces FS-Researcher, a dual-agent framework that uses a file system as persistent external memory to enable long-horizon deep research beyond LLM context limits. A Context Builder agent browses the web, writes structured notes, and populates a hierarchical knowledge base; a Report Writer agent then generates the final report section-by-section from this base. The file system serves as durable shared memory and coordination medium. Experiments on DeepResearch Bench and DeepConsult report state-of-the-art report quality across backbone models and a positive correlation between quality and compute allocated to the Context Builder.

Significance. If the empirical claims hold after proper controls and validation, the work demonstrates a practical mechanism for test-time scaling in agentic research via external file-system memory, addressing a key bottleneck in long trajectories. The open-sourcing of code and data is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[Experiments] Experiments section: the abstract states SOTA report quality and a positive correlation with Context Builder compute, yet provides no concrete metrics (e.g., exact scoring rubric or human/AI judge protocol), baseline systems, statistical significance tests, or controls for confounding variables such as total token budget, prompt engineering effort, or number of agent turns.
[Framework] Framework and analysis sections: the central claim that the file system functions as reliable, lossless external memory and coordination medium is load-bearing for both SOTA results and the scaling correlation, but no ablation, retrieval-precision metric, inconsistency rate, or fidelity check on note writing/reading is reported to validate this assumption at scale.

minor comments (2)

[Abstract] Abstract: the description of the two benchmarks could include one sentence on their task distribution and evaluation protocol to help readers assess generalizability.
Consider adding a diagram or pseudocode illustrating the exact read/write protocol between the two agents and the file system to clarify coordination mechanics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed feedback. The comments highlight important areas for strengthening the experimental rigor and framework validation. We address each major comment below and have revised the manuscript to incorporate the requested details and additional analyses.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract states SOTA report quality and a positive correlation with Context Builder compute, yet provides no concrete metrics (e.g., exact scoring rubric or human/AI judge protocol), baseline systems, statistical significance tests, or controls for confounding variables such as total token budget, prompt engineering effort, or number of agent turns.

Authors: We agree that the original manuscript would benefit from greater explicitness on these points. In the revised version, we have expanded the Experiments section with: the complete human evaluation rubric and inter-annotator agreement statistics; a table listing all baseline systems with their exact configurations and token budgets; results of statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with reported p-values); and controls that normalize performance by total token usage and include an ablation on prompt-engineering variations to isolate the contribution of the file-system mechanism. revision: yes
Referee: [Framework] Framework and analysis sections: the central claim that the file system functions as reliable, lossless external memory and coordination medium is load-bearing for both SOTA results and the scaling correlation, but no ablation, retrieval-precision metric, inconsistency rate, or fidelity check on note writing/reading is reported to validate this assumption at scale.

Authors: The referee is correct that direct quantitative validation of note fidelity was missing. While the observed scaling correlation already provides indirect support, we have added a new analysis subsection containing: an ablation that replaces the hierarchical file system with an in-context memory baseline; retrieval-precision metrics (recall@K and precision@K on key facts extracted from the knowledge base); and a manual audit of inconsistency rates across a sampled subset of notes. These additions confirm that the file system maintains high fidelity at the scales used in our experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of inputs

full rationale

The paper introduces FS-Researcher as a dual-agent framework using a file system for persistent memory in long-horizon research tasks. Its central claims—SOTA report quality on DeepResearch Bench and DeepConsult, plus positive correlation between report quality and Context Builder compute—are supported by direct experimental comparisons across backbone models rather than any derivation, equation, or self-citation that reduces to fitted parameters or prior author results by construction. No mathematical ansatz, uniqueness theorem, or renaming of known patterns appears; the file-system coordination is presented as an engineering choice validated through open-sourced code and external benchmarks, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about LLM agent reliability and the practicality of file systems as shared memory; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption LLM agents can reliably browse the internet, produce structured notes, and retrieve information from a file system without critical errors
Invoked in the design of the Context Builder and Report Writer agents.

pith-pipeline@v0.9.0 · 5556 in / 1237 out tokens · 36125 ms · 2026-05-16T08:55:41.401961+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

InForty-second International Conference on Machine Learning

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning. Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, and Weiran Yao. 2025. Enterprise deep research: Steerable multi-agent deep research ...

work page arXiv 2025
[2]

Target Hierar- chy

Hierarchy Match: Does the directory structure of the workspace perfectly mirror the “Target Hierar- chy” defined inindex.md?

work page
[3]

No Placeholders: Are there any non-descriptive filenames (like source_1, notes.md, etc.) existing in the knowledge base?

work page
[4]

Distilled Note

Full Traceability: Does every “Distilled Note” contain citations (relative paths) pointing to its corresponding “Archived Source” file?

work page
[5]

Exhaustive Coverage: Can I raise a new question about the topic that cannot be fully addressed by the knowledge base? Are there any missingregion- or segment-specific informationwhere relevant? Are there any important aspects where you only have 1–2 weak sources?

work page
[6]

Does it contain specific data/- facts, or just vague summaries? If vague, fetch again and extract details

Information Density: Open a random .md file in knowledge_base/. Does it contain specific data/- facts, or just vague summaries? If vague, fetch again and extract details. Report Writer: Section-level Checklist Before you end a section-writing round, verify the following aspects with self-asking: •Content: –Is the content in thereport_outline.mdcovered and...

work page
[7]

Tasks Complete? NO -- Some TODOs remain (UNECE UN R79/ALKS; NHTSA definitions; more case law)

work page
[8]

Hierarchy Match? PARTIAL -- All created folders match Target Hierarchy; one planned leaf remains to be added when sources are fetched

work page
[9]

Decisions for next round - Prioritize fetching UNECE R79/ALKS official text and NHTSA/SAE official materials via alternative accessible endpoints

Exhaustive Coverage? NO -- Missing UNECE/NHTSA primary source coverage; limited regional case law beyond US. Decisions for next round - Prioritize fetching UNECE R79/ALKS official text and NHTSA/SAE official materials via alternative accessible endpoints. - Add EU/Germany/UK ADAS-related case law where available; expand comparative matrix. # Round 2 Works...

work page
[10]

Tasks Complete? NO -- Outstanding TODOs: UNECE R79/ALKS extracts; NHTSA definitions/human-factors ; non-US ADAS case law

work page
[11]

Hierarchy Match? PARTIAL -- International/UNECE leaf planned in index; not yet created pending source capture

work page
[12]

# Round 3 Workspace status summary (Round 3) - Added international/UNECE evidence via InterRegs (ALKS R157) and ATIC (UN R79) summaries

Exhaustive Coverage? NO -- Key international (UNECE) and US NHTSA primary references missing; regional case law beyond US still to add. # Round 3 Workspace status summary (Round 3) - Added international/UNECE evidence via InterRegs (ALKS R157) and ATIC (UN R79) summaries. - Deepened comparative matrix and synthesis with UNECE and NHTSA references. Self-ch...

work page
[13]

Tasks Complete? YES -- All TODOs marked COMPLETE; comparative matrix and recommendations updated

work page
[14]

Net Winrate

Exhaustive Coverage? YES -- Remaining primary PDFs (UNECE/NHTSA) were inaccessible to fetch, but cross-validated via reputable summaries and official SGO definitions. [...] 14 C Benchmark Details DeepResearch Bench.DeepResearch Bench scores a system along two complementary axes: (i)report quality via RACE (Reference-based Adaptive Criteria-driven Evaluati...

work page

[1] [1]

InForty-second International Conference on Machine Learning

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning. Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, and Weiran Yao. 2025. Enterprise deep research: Steerable multi-agent deep research ...

work page arXiv 2025

[2] [2]

Target Hierar- chy

Hierarchy Match: Does the directory structure of the workspace perfectly mirror the “Target Hierar- chy” defined inindex.md?

work page

[3] [3]

No Placeholders: Are there any non-descriptive filenames (like source_1, notes.md, etc.) existing in the knowledge base?

work page

[4] [4]

Distilled Note

Full Traceability: Does every “Distilled Note” contain citations (relative paths) pointing to its corresponding “Archived Source” file?

work page

[5] [5]

Exhaustive Coverage: Can I raise a new question about the topic that cannot be fully addressed by the knowledge base? Are there any missingregion- or segment-specific informationwhere relevant? Are there any important aspects where you only have 1–2 weak sources?

work page

[6] [6]

Does it contain specific data/- facts, or just vague summaries? If vague, fetch again and extract details

Information Density: Open a random .md file in knowledge_base/. Does it contain specific data/- facts, or just vague summaries? If vague, fetch again and extract details. Report Writer: Section-level Checklist Before you end a section-writing round, verify the following aspects with self-asking: •Content: –Is the content in thereport_outline.mdcovered and...

work page

[7] [7]

Tasks Complete? NO -- Some TODOs remain (UNECE UN R79/ALKS; NHTSA definitions; more case law)

work page

[8] [8]

Hierarchy Match? PARTIAL -- All created folders match Target Hierarchy; one planned leaf remains to be added when sources are fetched

work page

[9] [9]

Decisions for next round - Prioritize fetching UNECE R79/ALKS official text and NHTSA/SAE official materials via alternative accessible endpoints

Exhaustive Coverage? NO -- Missing UNECE/NHTSA primary source coverage; limited regional case law beyond US. Decisions for next round - Prioritize fetching UNECE R79/ALKS official text and NHTSA/SAE official materials via alternative accessible endpoints. - Add EU/Germany/UK ADAS-related case law where available; expand comparative matrix. # Round 2 Works...

work page

[10] [10]

Tasks Complete? NO -- Outstanding TODOs: UNECE R79/ALKS extracts; NHTSA definitions/human-factors ; non-US ADAS case law

work page

[11] [11]

Hierarchy Match? PARTIAL -- International/UNECE leaf planned in index; not yet created pending source capture

work page

[12] [12]

# Round 3 Workspace status summary (Round 3) - Added international/UNECE evidence via InterRegs (ALKS R157) and ATIC (UN R79) summaries

Exhaustive Coverage? NO -- Key international (UNECE) and US NHTSA primary references missing; regional case law beyond US still to add. # Round 3 Workspace status summary (Round 3) - Added international/UNECE evidence via InterRegs (ALKS R157) and ATIC (UN R79) summaries. - Deepened comparative matrix and synthesis with UNECE and NHTSA references. Self-ch...

work page

[13] [13]

Tasks Complete? YES -- All TODOs marked COMPLETE; comparative matrix and recommendations updated

work page

[14] [14]

Net Winrate

Exhaustive Coverage? YES -- Remaining primary PDFs (UNECE/NHTSA) were inaccessible to fetch, but cross-validated via reputable summaries and official SGO definitions. [...] 14 C Benchmark Details DeepResearch Bench.DeepResearch Bench scores a system along two complementary axes: (i)report quality via RACE (Reference-based Adaptive Criteria-driven Evaluati...

work page