FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
Pith reviewed 2026-05-16 08:55 UTC · model grok-4.3
The pith
A file-system-based dual-agent system lets large language models conduct deep research beyond their context windows by using persistent external memory for knowledge accumulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FS-Researcher is a file-system-based, dual-agent framework in which a Context Builder agent acts as a librarian that browses the internet, produces structured notes, and archives raw sources into a hierarchical knowledge base that grows far beyond context length, while a Report Writer agent composes the final report section by section by treating the knowledge base as its factual source. The file system functions as durable external memory and a shared coordination medium, enabling iterative refinement and test-time scaling that would otherwise be impossible inside a single context window. On two open-ended benchmarks the resulting reports reach state-of-the-art quality that improves in lock
What carries the argument
The file system as durable external memory and shared coordination medium between a Context Builder agent that populates a hierarchical knowledge base and a Report Writer agent that consumes it for report generation.
If this is right
- Final report quality improves with greater computation allocated to the Context Builder agent.
- The framework delivers state-of-the-art report quality on DeepResearch Bench and DeepConsult across different backbone models.
- Long research trajectories can be managed without forcing evidence collection and report writing to compete inside a single context window.
- Agents gain the ability to iterate and refine work across multiple sessions through the persistent shared workspace.
Where Pith is reading between the lines
- The same persistent-storage pattern could support other long-horizon agent tasks such as multi-step code development or experimental planning that require retaining large bodies of intermediate results.
- Substituting the file system with alternative durable stores such as vector databases or versioned object stores might retain the scaling benefit while changing the error profile.
- Scaling will eventually be limited by the accuracy and organization of the accumulated knowledge base rather than by raw context length.
Load-bearing premise
The file system reliably stores and retrieves structured notes and sources without introducing retrieval errors, coordination failures, or data inconsistencies that would degrade agent performance.
What would settle it
An experiment in which increasing the compute budget allocated to the Context Builder produces no improvement or a decline in final report quality because of accumulated retrieval errors or inconsistencies in the file-system knowledge base.
Figures
read the original abstract
Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are open-sourced at https://github.com/Ignoramus0817/FS-Researcher.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FS-Researcher, a dual-agent framework that uses a file system as persistent external memory to enable long-horizon deep research beyond LLM context limits. A Context Builder agent browses the web, writes structured notes, and populates a hierarchical knowledge base; a Report Writer agent then generates the final report section-by-section from this base. The file system serves as durable shared memory and coordination medium. Experiments on DeepResearch Bench and DeepConsult report state-of-the-art report quality across backbone models and a positive correlation between quality and compute allocated to the Context Builder.
Significance. If the empirical claims hold after proper controls and validation, the work demonstrates a practical mechanism for test-time scaling in agentic research via external file-system memory, addressing a key bottleneck in long trajectories. The open-sourcing of code and data is a clear strength that supports reproducibility and follow-up work.
major comments (2)
- [Experiments] Experiments section: the abstract states SOTA report quality and a positive correlation with Context Builder compute, yet provides no concrete metrics (e.g., exact scoring rubric or human/AI judge protocol), baseline systems, statistical significance tests, or controls for confounding variables such as total token budget, prompt engineering effort, or number of agent turns.
- [Framework] Framework and analysis sections: the central claim that the file system functions as reliable, lossless external memory and coordination medium is load-bearing for both SOTA results and the scaling correlation, but no ablation, retrieval-precision metric, inconsistency rate, or fidelity check on note writing/reading is reported to validate this assumption at scale.
minor comments (2)
- [Abstract] Abstract: the description of the two benchmarks could include one sentence on their task distribution and evaluation protocol to help readers assess generalizability.
- Consider adding a diagram or pseudocode illustrating the exact read/write protocol between the two agents and the file system to clarify coordination mechanics.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed feedback. The comments highlight important areas for strengthening the experimental rigor and framework validation. We address each major comment below and have revised the manuscript to incorporate the requested details and additional analyses.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the abstract states SOTA report quality and a positive correlation with Context Builder compute, yet provides no concrete metrics (e.g., exact scoring rubric or human/AI judge protocol), baseline systems, statistical significance tests, or controls for confounding variables such as total token budget, prompt engineering effort, or number of agent turns.
Authors: We agree that the original manuscript would benefit from greater explicitness on these points. In the revised version, we have expanded the Experiments section with: the complete human evaluation rubric and inter-annotator agreement statistics; a table listing all baseline systems with their exact configurations and token budgets; results of statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with reported p-values); and controls that normalize performance by total token usage and include an ablation on prompt-engineering variations to isolate the contribution of the file-system mechanism. revision: yes
-
Referee: [Framework] Framework and analysis sections: the central claim that the file system functions as reliable, lossless external memory and coordination medium is load-bearing for both SOTA results and the scaling correlation, but no ablation, retrieval-precision metric, inconsistency rate, or fidelity check on note writing/reading is reported to validate this assumption at scale.
Authors: The referee is correct that direct quantitative validation of note fidelity was missing. While the observed scaling correlation already provides indirect support, we have added a new analysis subsection containing: an ablation that replaces the hierarchical file system with an in-context memory baseline; retrieval-precision metrics (recall@K and precision@K on key facts extracted from the knowledge base); and a manual audit of inconsistency rates across a sampled subset of notes. These additions confirm that the file system maintains high fidelity at the scales used in our experiments. revision: yes
Circularity Check
No circularity: empirical benchmark results independent of inputs
full rationale
The paper introduces FS-Researcher as a dual-agent framework using a file system for persistent memory in long-horizon research tasks. Its central claims—SOTA report quality on DeepResearch Bench and DeepConsult, plus positive correlation between report quality and Context Builder compute—are supported by direct experimental comparisons across backbone models rather than any derivation, equation, or self-citation that reduces to fitted parameters or prior author results by construction. No mathematical ansatz, uniqueness theorem, or renaming of known patterns appears; the file-system coordination is presented as an engineering choice validated through open-sourced code and external benchmarks, with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can reliably browse the internet, produce structured notes, and retrieve information from a file system without critical errors
Reference graph
Works this paper leans on
-
[1]
InForty-second International Conference on Machine Learning
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning. Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, and Weiran Yao. 2025. Enterprise deep research: Steerable multi-agent deep research ...
-
[2]
Hierarchy Match: Does the directory structure of the workspace perfectly mirror the “Target Hierar- chy” defined inindex.md?
-
[3]
No Placeholders: Are there any non-descriptive filenames (like source_1, notes.md, etc.) existing in the knowledge base?
-
[4]
Full Traceability: Does every “Distilled Note” contain citations (relative paths) pointing to its corresponding “Archived Source” file?
-
[5]
Exhaustive Coverage: Can I raise a new question about the topic that cannot be fully addressed by the knowledge base? Are there any missingregion- or segment-specific informationwhere relevant? Are there any important aspects where you only have 1–2 weak sources?
-
[6]
Information Density: Open a random .md file in knowledge_base/. Does it contain specific data/- facts, or just vague summaries? If vague, fetch again and extract details. Report Writer: Section-level Checklist Before you end a section-writing round, verify the following aspects with self-asking: •Content: –Is the content in thereport_outline.mdcovered and...
-
[7]
Tasks Complete? NO -- Some TODOs remain (UNECE UN R79/ALKS; NHTSA definitions; more case law)
-
[8]
Hierarchy Match? PARTIAL -- All created folders match Target Hierarchy; one planned leaf remains to be added when sources are fetched
-
[9]
Exhaustive Coverage? NO -- Missing UNECE/NHTSA primary source coverage; limited regional case law beyond US. Decisions for next round - Prioritize fetching UNECE R79/ALKS official text and NHTSA/SAE official materials via alternative accessible endpoints. - Add EU/Germany/UK ADAS-related case law where available; expand comparative matrix. # Round 2 Works...
-
[10]
Tasks Complete? NO -- Outstanding TODOs: UNECE R79/ALKS extracts; NHTSA definitions/human-factors ; non-US ADAS case law
-
[11]
Hierarchy Match? PARTIAL -- International/UNECE leaf planned in index; not yet created pending source capture
-
[12]
Exhaustive Coverage? NO -- Key international (UNECE) and US NHTSA primary references missing; regional case law beyond US still to add. # Round 3 Workspace status summary (Round 3) - Added international/UNECE evidence via InterRegs (ALKS R157) and ATIC (UN R79) summaries. - Deepened comparative matrix and synthesis with UNECE and NHTSA references. Self-ch...
-
[13]
Tasks Complete? YES -- All TODOs marked COMPLETE; comparative matrix and recommendations updated
-
[14]
Exhaustive Coverage? YES -- Remaining primary PDFs (UNECE/NHTSA) were inaccessible to fetch, but cross-validated via reputable summaries and official SGO definitions. [...] 14 C Benchmark Details DeepResearch Bench.DeepResearch Bench scores a system along two complementary axes: (i)report quality via RACE (Reference-based Adaptive Criteria-driven Evaluati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.