arxiv: 2605.05258 · v1 · submitted 2026-05-06 · 💻 cs.SE

Recognition: 3 theorem links

PARNESS: A Paper Harness for End-to-End Automated Scientific Research with Dynamic Workflows, Full-Text Indexing, and Cross-Run Knowledge Accumulation

Yuchen Wang , Zhongzhi Luan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:09 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated scientific researchLLM agentsdynamic workflowsdeclarative pipelinesfull-text indexingknowledge graphsscientific workflowsopen-source framework

0 comments

The pith

PARNESS decouples workflow scheduling from domain specifics so any scientific research loop can be expressed as user-editable YAML while indexing full papers, code, and cross-run knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing autonomous research systems embed fixed control-flow shapes such as linear pipelines or single-agent loops directly into their frameworks, which prevents adaptation to discipline-specific patterns like lab experiments, surveys, simulations or theory. PARNESS counters this rigidity through a thin DAG kernel that treats workflows as declarative YAML files built on a simple four-field agent contract, plus subsystems for full-text PDF indexing of bodies, figures and tables, code-repository linking, and a knowledge graph that retrieves scenario-typed slices into each LLM call. The design keeps cumulative knowledge persistent and focused rather than forcing everything into one context window. A sympathetic reader would expect this to let automated research pipelines handle varied tasks without repeated framework rewrites and without losing experimental details hidden in paper bodies or repositories.

Core claim

PARNESS is presented as an open-source framework whose four design moves address the five roots of rigidity in prior systems: a thin DAG kernel with four-field Agent contract that decouples scheduling from domain semantics so any discipline loop becomes editable YAML, a full-text PDF-parsing subsystem that indexes paper bodies, figures and tables with abstract-only fallback, a knowledge-graph index over papers, ideas, experiments and code repositories with scenario-typed retrieval, and a small extension surface that lets any modern coding agent add or replace modules.

What carries the argument

Thin DAG kernel with four-field Agent contract that decouples scheduling from domain semantics and turns any discipline workflow into user-editable YAML.

If this is right

Workflows become dynamic and discipline-specific without any change to the core scheduler.
LLM agents receive full paper bodies, figures, tables and linked code repositories instead of summary-only views.
Cross-run knowledge accumulates in a retrievable graph and is sliced into each new LLM context.
Any modern coding agent can extend or replace modules through the provided extension surface.
Paper-to-code links become first-class objects rather than neglected afterthoughts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Porting an existing fixed-shape workflow from another agent into PARNESS YAML would directly test whether the claimed flexibility reduces redesign effort.
The scenario-typed retrieval could surface contradictory or cross-domain findings that single-context agents routinely miss.
Accumulated run data might later support automated meta-analysis of which workflow patterns succeed across domains.
The small extension surface would make it straightforward to plug in newer LLMs or specialized parsers without rewriting the kernel.

Load-bearing premise

The four-field Agent contract can express every discipline-specific workflow without major custom code, and the knowledge-graph retrieval will surface useful slices without injecting noise that harms LLM performance.

What would settle it

A concrete test in which a hybrid wet-lab plus simulation workflow cannot be expressed cleanly in the YAML without heavy custom extensions, or in which adding the knowledge-graph retrieval measurably lowers the quality of the LLM agent's output compared with the same task run without retrieval.

Figures

Figures reproduced from arXiv: 2605.05258 by Yuchen Wang, Zhongzhi Luan.

**Figure 1.** Figure 1: PARNESS as a paper harness. A single thin DAG kernel (centre) drives many different research scenarios as parallel lanes: wet-lab biology, social-science surveys, ML systems benchmarks, and theoretical/simulation studies. Each lane is a chain of pluggable agent modules that the user can swap, extend, or re-wire through ordinary YAML and through any GUI/TUI coding agent. A typed knowledge graph below the ke… view at source ↗

**Figure 2.** Figure 2: Four concrete PARNESS pipelines for four disciplines, all expressed in the same YAML DSL on the same DAG kernel. (a) An ML benchmark loop iterates idea generation against a quality gate, looping back to ideation if the gate score is too low. (b) A wet-lab biology pipeline holds an idea-discussion round before any experiment (peer-gate), and a statistical gate after replication that can trigger reruns. (c) … view at source ↗

**Figure 3.** Figure 3: Cross-domain ideation under finite LLM context. (a) A single ideator with a fixed context window can only accommodate a small subsample of the corpus, and even that subsample is read with the well-documented attention bias of lost-in-the-middle [14]. (b) PARNESS separates retrieval from reasoning: the KG indexer (§5.3) holds the full corpus; each cognitive-role agent (§6.6) is wired to a scenario-typed ret… view at source ↗

**Figure 4.** Figure 4: Reading a paper once is not the same as having it indexed. (a) AI-Scientist [1, 2] and DeepResearch [5] do parse full PDFs, but only episodically: the parsed text is consumed by the current call and then discarded. (b) PaperOrchestra [3], by contrast, only ever sees abstracts. In both regimes there is no long-lived corpus that the next step or the next run can search. (c) PARNESS indexes every parsed body … view at source ↗

**Figure 5.** Figure 5: The PARNESS paper↔code graph. Every parsed paper that ships a repository emits a typed derivation edge to its code-node; repository nodes are linked to each other by similarity edges. A paper without code (paper C) and a freshly-generated idea both reach the closest sibling repository through cross-paper inspiration edges. Several existing systems touch code per-task (AutoSOTA [4], AI-Scientist [1, 2], aut… view at source ↗

**Figure 6.** Figure 6: Knowledge accumulation across runs. Each PARNESS pipeline appends to long-lived stores (papers, ideas, hypotheses, evidence, KG triples). The next run begins from this corpus, retrieving only the relevant slice for the current step. The challenge is independent of whether the LLM is invoked once or many times: in both cases only a finite-sized slice of accumulated knowledge can fit in the prompt, and the e… view at source ↗

**Figure 7.** Figure 7: Scenario-typed retrieval over the PARNESS knowledge graph. The KG holds papers, ideas, experiments and code repositories as typed nodes connected by four edge types (structural / internal / semantic / walk). Retrieval adapters compose those edges into four scenario presets — similar (default RAG), opposite (contradictory results, useful for the Contrarian role), cross-domain (long-range walks, useful for t… view at source ↗

**Figure 8.** Figure 8: Every PARNESS module is a single Python class behind a single contract, registered with a one-line decorator and described by a YAML node. External coding agents — Claude Code, Cursor, Copilot, OpenCode, Kilo Code — can therefore add a new module, edit an existing module, or re-wire a pipeline by editing one Python file plus one YAML file. The pipeline validator checks the edit before the next run starts. … view at source ↗

**Figure 9.** Figure 9: The four-layer architecture of PARNESS. L0 (DAG kernel) is the foundation: a thin scheduler plus a four-field contract. L1 (Persistence) keeps state durable across nodes and across runs. L2 (Agents) is the population of LLM workers and tools that do the actual research. L3 (Pipeline) is the user surface: YAML pipelines and entry-point scripts. Configuration flows downward at start time; data and persisted … view at source ↗

**Figure 10.** Figure 10: Three uses of the four-field Agent contract. The runner makes no domain decisions: it merely follows the routing fields the upstream module emits. depends_on: [extract] input_mapping: seeds: extract.seeds - id: gate module: quality_scorer depends_on: [generate] routes: "continue": evaluate "stop": export config: max_rounds: 100 max_parallel: 0 GraphRunner chooses among three scheduling strategies based on… view at source ↗

**Figure 11.** Figure 11: (a) PaperOrchestra-style fixed five-step recipe: writing the paper assumes the inputs already exist, the topology is hard-coded, no upstream stages. (b) PARNESS pipeline: composition is data, dynamic fan-out (Connector/Analyst/etc. in parallel), score-gated loops (dashed arrows back to earlier stages), and full life-cycle coverage from crawl to review. Both diagrams represent real shipped pipelines view at source ↗

**Figure 12.** Figure 12: Six cognitive-role agents used in PARNESS ideation. The KG retrieval step (left) emits a _routes fan-out so the runner schedules each role in parallel; each role’s prompt is engineered for one orthogonal cognitive demand and is wired to a different scenario- typed retrieval slice (right of each role). The aggregator deduplicates and ranks the seeds before passing them to a downstream gate. None of the fra… view at source ↗

**Figure 13.** Figure 13: The eight-phase Knowledge-Graph indexing pipeline. LLM phases (orange) handle the open-ended steps — extracting insights, discovering intra-batch relations, semantic edge filtering, and weighted random-walk relation discovery. Deterministic phases (blue) handle dedup, embedding, persistence, and structural-edge replication from SQLite. The pipeline is incremental: a new ingestion batch only re-runs phases… view at source ↗

**Figure 1.** Figure 1: ConjNorm architecture search framework. Three parallel weight vectors (θagg, θmsg, θpool) are constrained by conjugatenorm pairs and optimised via Bregman divergence minimisation. The GNN evaluator produces validation-loss feedback that drives the gradient update. The parameter pd controls the concentration of θd: as pd → ∞, the distribution approaches a one-hot selection; as pd → 1, it becomes uniform. T… view at source ↗

**Figure 2.** Figure 2: ConjNorm search procedure. The loop of soft selection, GNN evaluation, Bregman gradient update, and conjugate projection repeats for T iterations before decoding the final architecture. The soft selection mechanism computes a weighted combination of all candidate operators, allowing gradient information from a single forward pass to flow to all architectural parameters. This avoids the need to evaluate ea… view at source ↗

**Figure 3.** Figure 3: Optimisation dynamics over ConjNorm search iterations. (a) Training loss converges smoothly over 15 epochs per iteration. (b) Bregman divergence decreases rapidly in early iterations and stabilises, indicating successful optimisation. (c) Architecture entropy declines moderately, reflecting increasing confidence in specific operator choices without full collapse to a single discrete configuration. Referenc… view at source ↗

read the original abstract

Recent autonomous research systems -- AI-Scientist, PaperOrchestra, AutoSOTA, DeepResearch, InternAgent, ResearchAgent and others -- show LLM agents can ideate, run experiments and write papers, but each fixes a particular control-flow shape (linear pipeline, state machine, single-agent loop, or fixed-recipe skill pack) at the framework level. We argue this rigidity has five roots: (1) workflows are dynamic and discipline-specific (lab work, surveys, simulations, theory all loop differently); (2) ideation is bounded by LLM context and cross-domain ideation needs knowledge a single context cannot hold; (3) summary-only views miss the paper body, yet full-text access is uneven, so the cumulative corpus must do the work; (4) a paper's open-source repository is often the only complete specification of its experimental scheme, but the paper-to-code link is neglected; (5) no tool persists cross-run knowledge retrievably into a finite LLM context. We present PARNESS, an open-source framework built on four design moves. (i) A thin DAG kernel with a four-field Agent contract decouples scheduling from domain semantics, so any discipline's loop is expressible as user-editable YAML. (ii) A full-text PDF-parsing and literature-library subsystem indexes paper bodies, figures and tables as typed objects, with graceful abstract-only fall-back. (iii) A knowledge-graph index over papers, ideas, experiments and code repositories, with scenario-typed retrieval (similar / contradictory / cross-domain / counter-intuitive), surfaces a focused slice into each LLM call. (iv) A small extension surface lets any modern coding agent (Claude Code, Cursor, Copilot, OpenCode) add or replace any module. To our knowledge PARNESS is the first open-source system combining declarative pipelines, full-PDF and code-repository indexing, and cross-run knowledge. Source: https://github.com/gtrhythm/PARNESS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PARNESS describes a flexible open-source system for LLM research agents using YAML DAGs, full-text indexing, and a cross-run knowledge graph, but offers no evaluations to test its core assumptions.

read the letter

PARNESS is an open-source framework that tries to make automated scientific research with LLMs less rigid than earlier tools. It uses a thin DAG kernel driven by a four-field agent contract and YAML files so users can define their own loops, adds full-text PDF parsing with code repository indexing, and keeps a knowledge graph that stores ideas, experiments, and results across runs with typed retrieval options like similar or contradictory papers.

Referee Report

3 major / 2 minor

Summary. The paper presents PARNESS, an open-source framework for end-to-end automated scientific research. It identifies five roots of rigidity in existing LLM-agent systems (fixed control-flow shapes, context bounds on ideation, summary-only views, neglected paper-to-code links, and lack of cross-run knowledge persistence) and proposes four design moves to overcome them: (i) a thin DAG kernel with a four-field Agent contract enabling user-editable YAML declarative pipelines for any discipline-specific loop; (ii) full-text PDF parsing and indexing of bodies, figures, and tables with abstract fallback; (iii) a knowledge-graph over papers/ideas/experiments/code with scenario-typed retrieval (similar/contradictory/cross-domain/counter-intuitive); and (iv) an extension surface for integration with coding agents. The central claim is that PARNESS is the first open-source system combining declarative pipelines, full-PDF and code-repository indexing, and cross-run knowledge accumulation.

Significance. If the design assumptions are validated, PARNESS could offer a flexible, extensible platform that enables more adaptive autonomous research across disciplines by decoupling scheduling from domain semantics and incorporating richer knowledge retrieval, addressing a genuine gap in current systems. The open-source release and emphasis on extensibility are strengths that could facilitate community adoption and further development.

major comments (3)

[Abstract] Abstract and design move (i): The claim that the thin DAG kernel with its four-field Agent contract can express any discipline-specific workflow (including lab work, simulations, theory) without significant limitations is load-bearing for the novelty and rigidity-overcoming assertions, yet the manuscript provides neither the semantics of the four fields, an expressiveness argument, nor worked examples of complex dynamic control such as result-dependent branching or multi-agent coordination.
[Design moves] Design move (iii): The assumption that scenario-typed KG retrieval surfaces useful slices without introducing noise that harms LLM performance is central to addressing root (5) and the overall knowledge-accumulation claim, but lacks any precision/recall analysis, ablation studies, or empirical demonstration of retrieval quality.
[Evaluation] Evaluation section (or lack thereof): The manuscript describes the architecture and motivations but contains no experimental results, benchmarks, user studies, or case studies to show that PARNESS actually improves research outcomes, reduces rigidity, or outperforms existing systems; this undermines the central claim that the design moves are effective.

minor comments (2)

[Abstract] The four fields of the Agent contract are referenced but not explicitly defined or illustrated with YAML examples, which would improve clarity for readers attempting to understand or extend the system.
[Related Work] A comparison table contrasting PARNESS control-flow flexibility against the listed systems (AI-Scientist, PaperOrchestra, etc.) would help ground the five-roots argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and design move (i): The claim that the thin DAG kernel with its four-field Agent contract can express any discipline-specific workflow (including lab work, simulations, theory) without significant limitations is load-bearing for the novelty and rigidity-overcoming assertions, yet the manuscript provides neither the semantics of the four fields, an expressiveness argument, nor worked examples of complex dynamic control such as result-dependent branching or multi-agent coordination.

Authors: We agree that the current description of design move (i) would benefit from greater precision. The four-field Agent contract (task specification, dependency declaration, execution interface, and state persistence) is intended to provide a minimal scheduling abstraction that decouples control flow from domain logic, enabling arbitrary workflows via user-editable YAML. In the revised manuscript we will add: (a) explicit semantics for each field, (b) a short expressiveness argument showing how result-dependent branching, loops, and multi-agent handoff are encoded as DAG edges and hooks, and (c) two concrete YAML examples—one for a simulation-based workflow and one for an iterative theoretical derivation loop. These additions will be placed in a new subsection under design move (i). revision: yes
Referee: [Design moves] Design move (iii): The assumption that scenario-typed KG retrieval surfaces useful slices without introducing noise that harms LLM performance is central to addressing root (5) and the overall knowledge-accumulation claim, but lacks any precision/recall analysis, ablation studies, or empirical demonstration of retrieval quality.

Authors: We accept that empirical characterization of retrieval quality would improve the paper. The scenario-typed retrieval mechanism filters the knowledge graph by query intent (similar, contradictory, cross-domain, counter-intuitive) before injection into the LLM context. While the initial submission focused on architecture, the revised version will include a new subsection under design move (iii) that reports precision and recall on a small set of manually curated test queries drawn from the indexed literature, together with qualitative examples of retrieved slices. We will also note that comprehensive ablation studies remain future work. revision: partial
Referee: [Evaluation] Evaluation section (or lack thereof): The manuscript describes the architecture and motivations but contains no experimental results, benchmarks, user studies, or case studies to show that PARNESS actually improves research outcomes, reduces rigidity, or outperforms existing systems; this undermines the central claim that the design moves are effective.

Authors: The central claim of the manuscript is that PARNESS is the first open-source system to combine declarative pipelines, full-PDF and code-repository indexing, and cross-run knowledge accumulation; it does not assert empirical superiority in research outcomes. To address the referee’s concern we will (a) revise the abstract and introduction to state the novelty claim more precisely, (b) add a qualitative case-study subsection illustrating a complete research loop executed with PARNESS, and (c) include a dedicated Limitations and Future Work section that explicitly calls for subsequent user studies and comparative benchmarks. These changes will clarify the scope of the present contribution while responding to the request for concrete illustration. revision: yes

Circularity Check

0 steps flagged

No circularity detected; system description with independent design choices

full rationale

The paper presents PARNESS as an open-source framework addressing rigidity in autonomous research systems through four explicit design moves: a thin DAG kernel with four-field Agent contract, full-text PDF indexing, scenario-typed knowledge-graph retrieval, and an extension surface for coding agents. These are introduced as architectural decisions in the abstract and full text, not as predictions, first-principles derivations, or fitted results. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear; the novelty claim ('first open-source system combining...') rests on comparison to external systems rather than internal loops. The work is a self-contained implementation description grounded in the released codebase, with no derivation chain that reduces outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on assumptions about LLM capabilities and the effectiveness of the proposed architecture components, with no free parameters as this is a software system rather than a mathematical model.

axioms (2)

domain assumption LLM agents can effectively use the four-field Agent contract to perform domain-specific tasks.
The framework relies on this for the DAG kernel to work across disciplines.
domain assumption Full-text PDF parsing provides sufficient structured data for knowledge indexing.
Assumes the parsing subsystem works reliably.

invented entities (1)

PARNESS framework no independent evidence
purpose: To provide the integrated system for automated research.
The framework is the contribution itself.

pith-pipeline@v0.9.0 · 5676 in / 1417 out tokens · 81799 ms · 2026-05-08T18:09:11.502608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Domain mismatch: cs.SE orchestration vs. RS forcing chain (reality_from_one_distinction, J-cost uniqueness, φ derivation). No RS theorem is engaged or contradicted. reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PARNESS demonstrates that the components of an autonomous research system ... compose naturally under a thin DAG kernel with a four-field agent contract.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages

[1]

Ramakrishnan, P

R. Ramakrishnan, P . Dral, M. Rupp and O. von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1:140022, 2014

2014
[2]

K. Xu, W. Hu, J. Leskovec and S. Jegelka. How Powerful are Graph Neural Networks? International Conference on Learning Representations, 2019

2019
[3]

Klicpera, J

J. Klicpera, J. Groß and S. Günnemann. Directional Message Passing for Molecular Graphs. International Conference on Learning Representations, 2020

2020
[4]

Z. Li, X. Wang and others. On the Completeness of Invariant Geometric Deep Learning Models. 2024

2024
[5]

P . Ren, Y . Xiao, X. Chang, P . Wang and B. Li. A Comprehensive Survey of Neural Architecture Search. ACM Computing Surveys, 2020

2020
[6]

H. Liu, K. Simonyan and Y . Y ang. DARTS: Differentiable Architecture Search. International Conference on Learning Representations, 2019

2019
[7]

Banerjee, S

A. Banerjee, S. Merugu, I. Dhillon and J. Ghosh. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6:1705–1749, 2005

2005
[8]

Zoph and Q

B. Zoph and Q. V . Le. Neural Architecture Search with Reinforcement Learning. International Conference on Learning Representations, 2017

2017
[9]

Y . Gao, H. Y ang and others. GraphNAS: Graph Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1904.09981, 2019

work page arXiv 1904
[10]

K. Zhou, Q. Song and others. Auto-GNN: Neural Architecture Search of Graph Neural Networks. arXiv preprint arXiv:1909.03184, 2019

work page arXiv 1909
[11]

Chmiela, A

S. Chmiela, A. Tkatchenko and others. Machine learning of accurate energy-conserving molecular force fields. Science Advances, 3(5):e1603015, 2017. 6 29 Conjugate Architecture Search A PREPRINT

2017
[12]

W. Hu, M. Fey and others. Open Graph Benchmark: Datasets for Machine Learning on Graphs. NeurIPS, 2020

2020
[13]

V . G. Satorras, E. Hoogeboom and M. Welling. E(n) Equivariant Graph Neural Networks. International Confer- ence on Machine Learning , 2021. 7 30

2021