arxiv: 2604.17309 · v1 · submitted 2026-04-19 · 💻 cs.AI

Recognition: unknown

Knows: Agent-Native Structured Research Representations

Guangsheng Yu , Xu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent-native researchstructured research representationsYAML sidecarLLM agentsresearch comprehensionPDF alternativesmachine-readable papers

0 comments

The pith

A lightweight YAML sidecar lets small LLM agents extract accurate research details from papers with far less computation than reading PDFs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Knows as a companion specification that attaches structured claims, evidence, provenance and relations to existing PDFs in a YAML format agents can read directly. This removes the need for agents to parse long, reader-oriented documents that are expensive and unstable to process at scale. Tests across 140 questions on 20 papers show that models with 0.8B to 2B parameters jump from 19-25% to 47-67% accuracy when given the sidecar instead of the PDF and consume 29-86% fewer tokens. An independent judge re-scoring finds the weak-model sidecar scores approach those of much stronger models using PDFs. A public hub already holds sidecars for over ten thousand papers, indicating the format supports real adoption.

Core claim

Knows supplies a thin YAML sidecar (KnowsRecord) that coexists with any PDF and binds machine-readable claims, evidence, provenance and verifiable relations to the paper, letting LLM agents consume task-relevant content directly without the cost and instability of full-document inference.

What carries the argument

The KnowsRecord, a YAML sidecar validated by a deterministic schema linter, which structures the paper's content for direct agent consumption while leaving the original PDF unchanged.

If this is right

Weak models can handle research comprehension tasks at accuracy levels previously requiring much larger models.
Agent workflows that process many papers become cheaper and more stable because token counts fall sharply.
Research can be distributed in dual form: human-readable PDF plus machine-readable sidecar without changing publication practices.
Hybrid agent pipelines can fall back to the PDF only when the sidecar lacks needed detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If sidecar creation tools improve, the format could become the default machine interface for new papers.
The same sidecar approach might apply to other long documents such as patents or technical reports.
Automated generation of sidecars from existing PDFs could create a large public dataset for training better research agents.

Load-bearing premise

That accurate structured sidecar content can be created at scale for arbitrary papers without errors or omissions that would mislead agents on real tasks.

What would settle it

Generate sidecars for a fresh set of papers, then measure whether agent accuracy on questions requiring details not explicitly listed in the sidecar drops below the PDF baseline.

Figures

Figures reproduced from arXiv: 2604.17309 by Guangsheng Yu, Xu Wang.

**Figure 2.** Figure 2: Architecture of the KnowsRecord specification and tooling ecosystem. Left: the KnowsRecord object model with [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Agent Retrieval Flow. Rounded rectangles are pro [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Three phases of the agent-native publishing trajectory. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Research artifacts are distributed primarily as reader-oriented documents like PDFs. This creates a bottleneck for increasingly agent-assisted and agent-native research workflows, in which LLM agents need to infer fine-grained, task-relevant information from lengthy full documents, a process that is expensive, repetitive, and unstable at scale. We introduce Knows, a lightweight companion specification that binds structured claims, evidence, provenance, and verifiable relations to existing research artifacts in a form LLM agents can consume directly. Knows addresses the gap with a thin YAML sidecar (KnowsRecord) that coexists with the original PDF, requiring no changes to the publication itself, and validated by a deterministic schema linter. We evaluate Knows on 140 comprehension questions across 20 papers spanning 14 academic disciplines, comparing PDF-only, sidecar-only, and hybrid conditions across six LLM agents of varying capacity. Weak models (0.8B--2B parameters) improve from 19--25\% to 47--67\% accuracy (+29 to +42 percentage points) when reading sidecar instead of PDF, while consuming 29--86\% fewer input tokens; an LLM-as-judge re-scoring confirms that weak-model sidecar accuracy (75--77\%) approaches stronger-model PDF accuracy (78--83\%). Beyond this controlled evaluation, a community sidecar hub at https://knows.academy/ has already indexed over ten thousand publications and continues to grow daily, providing independent evidence that the format is adoption-ready at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Knows gives small LLMs a big lift on paper comprehension via a YAML sidecar, but the gains rest on untested assumptions about how accurately those sidecars are built.

read the letter

The paper's core contribution is a simple YAML sidecar format (KnowsRecord) that sits next to a PDF and encodes claims, evidence, and relations in a machine-readable way. On their 140-question test set from 20 papers, weak models (0.8B–2B) jump from 19-25% accuracy with raw PDFs to 47-67% with the sidecar alone, while cutting token use by 29-86%. Stronger models see smaller but still positive shifts, and an LLM judge puts the weak-model sidecar scores near the strong-model PDF baseline. A community site with over 10k entries shows the format is already spreading without any central push. That combination of concrete numbers and early adoption is the part worth paying attention to if you work on agentic literature tools. The schema itself is lightweight and comes with a deterministic linter, which keeps the barrier low. The evaluation covers multiple disciplines and compares PDF-only, sidecar-only, and hybrid conditions across six models, so the setup is more than just a toy demo. The token savings are a practical bonus that matters for cost and context length. The main soft spot is the missing audit on sidecar quality. The reported gains assume the YAML files faithfully capture everything an agent needs from the original papers. The abstract and methods summary give no details on whether the 20 test sidecars were written by hand, generated by LLMs, or reviewed for omissions and drift. Without inter-annotator checks or an error sample, it's hard to know how much the accuracy lift would shrink on fresh papers where sidecar creation is imperfect. The community hub proves people can produce the files, but supplies no downstream accuracy sampling either. This is a real gap rather than a minor omission, because the entire value proposition depends on the sidecars being reliable at scale. The work is aimed at people building or evaluating LLM agents for research tasks—literature review, claim extraction, or automated synthesis. A reader already working on structured representations or agent-native workflows will find the benchmark numbers and format specification directly usable for experiments. It is coherent enough and grounded enough in measurable outcomes to deserve a serious referee. The idea is testable, the claims are falsifiable, and the practical angle is clear. I would send it to review and ask specifically for the sidecar generation protocol plus any quality metrics they can add.

Referee Report

2 major / 2 minor

Summary. The paper introduces Knows, a lightweight YAML sidecar (KnowsRecord) specification that attaches structured claims, evidence, provenance, and verifiable relations to existing research PDFs without modifying the original artifact. It evaluates the format on 140 comprehension questions drawn from 20 papers across 14 disciplines, comparing PDF-only, sidecar-only, and hybrid inputs across six LLM agents of varying sizes. The central empirical result is that weak models (0.8B–2B parameters) improve from 19–25% to 47–67% accuracy (+29 to +42 pp) when using sidecars instead of PDFs while consuming 29–86% fewer tokens; an LLM-as-judge re-scoring shows weak-model sidecar performance approaching stronger-model PDF performance. The paper also reports a community hub at knows.academy that has indexed over 10,000 publications.

Significance. If sidecar creation can be shown to be accurate and scalable without systematic omissions or factual drift, the approach would meaningfully lower the cost and instability of agent-native research workflows. The reported token reductions and accuracy lifts for small models are practically relevant, and the observed community adoption supplies independent evidence of format viability. However, the evaluation's dependence on un-audited sidecar fidelity limits the strength of the claims until that assumption is tested.

major comments (2)

[Evaluation] Evaluation section: the paper provides no description of how the KnowsRecords for the 20-paper, 140-question test set were generated (manual, LLM-assisted, or hybrid), nor any inter-annotator agreement, omission-rate, or factual-drift measurements against the source PDFs. Because the headline accuracy gains (+29 to +42 pp) and token savings are measured only under the assumption that these sidecars faithfully encode all task-relevant content, the absence of such validation is load-bearing for the central empirical claim.
[Community Hub / Discussion] Community hub paragraph: the report of >10k indexed publications is presented as evidence of adoption-readiness, yet no sampling, accuracy audit, or downstream-task verification of the community-contributed KnowsRecords is supplied. Without such data it is impossible to assess whether the format maintains fidelity when applied at scale to arbitrary papers.

minor comments (2)

[Abstract] The abstract states that an LLM-as-judge re-scoring was performed but does not specify the judge model, prompt template, or agreement metric with human judgments; adding these details would improve reproducibility.
[Methods] The YAML schema is said to be validated by a deterministic linter, but the paper does not indicate whether the linter was run on the evaluation sidecars or only on the schema definition itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. The comments highlight important aspects of reproducibility and scalability that strengthen the manuscript. We address each major comment below, indicating where revisions will be made to incorporate additional methodological details and caveats.

read point-by-point responses

Referee: Evaluation section: the paper provides no description of how the KnowsRecords for the 20-paper, 140-question test set were generated (manual, LLM-assisted, or hybrid), nor any inter-annotator agreement, omission-rate, or factual-drift measurements against the source PDFs. Because the headline accuracy gains (+29 to +42 pp) and token savings are measured only under the assumption that these sidecars faithfully encode all task-relevant content, the absence of such validation is load-bearing for the central empirical claim.

Authors: We agree that explicit documentation of sidecar creation is necessary to support the empirical claims. The 20 KnowsRecords were generated manually by the authors through direct extraction of claims, evidence, and relations from the source PDFs using the Knows specification; no LLM assistance was used for the test set. We will add a dedicated subsection to the Evaluation section describing the extraction protocol, including how completeness was targeted and cross-checked within the team. We did not compute formal inter-annotator agreement metrics because creation was performed by a small expert team with internal review rather than independent annotators. We will explicitly note this limitation and its implications for the results. Factual drift was minimized by restricting content to verbatim or near-verbatim extractions without summarization or external inference; we will add a short statement to this effect. revision: yes
Referee: Community hub paragraph: the report of >10k indexed publications is presented as evidence of adoption-readiness, yet no sampling, accuracy audit, or downstream-task verification of the community-contributed KnowsRecords is supplied. Without such data it is impossible to assess whether the format maintains fidelity when applied at scale to arbitrary papers.

Authors: We accept that the community adoption figure alone does not constitute a fidelity audit. The >10,000 records are contributed by users through the public hub and validated only by the deterministic schema linter; no systematic sampling or downstream-task verification has been performed by the authors. We will revise the relevant paragraph in the Discussion to present the hub statistics strictly as evidence of format interest and uptake rather than proven scalability of content quality. We will also add an explicit statement that large-scale fidelity audits remain future work and invite community contributions toward that goal. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical evaluation

full rationale

The paper presents an empirical study measuring LLM accuracy and token usage across PDF-only, sidecar-only, and hybrid conditions on 140 questions from 20 papers. No mathematical derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The deterministic schema linter validates format only and does not enter the performance measurements. Community adoption at knows.academy is cited as external evidence of format uptake, not as justification for the accuracy deltas. The evaluation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new specification and an empirical demonstration; it does not rely on fitted numerical parameters or unstated mathematical axioms beyond standard assumptions about document parsing and LLM token usage.

axioms (1)

domain assumption LLM agents can extract task-relevant information more reliably from structured YAML than from raw PDF text
This premise underpins the entire evaluation design and accuracy comparison.

invented entities (1)

KnowsRecord no independent evidence
purpose: Lightweight YAML sidecar that binds structured claims, evidence, provenance, and relations to a research PDF
New format defined by the paper; no independent falsifiable evidence outside the specification itself is provided.

pith-pipeline@v0.9.0 · 5556 in / 1348 out tokens · 57045 ms · 2026-05-10T06:46:32.357968+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Agents4Science: The first open conference of AI agents, co-led by Stanford and BroadAI,

Stanford University and BroadAI, “Agents4Science: The first open conference of AI agents, co-led by Stanford and BroadAI,” https://agents4science.stanford.edu/, 2025, conference where AI agents are the primary authors, reviewers, and presenters of research contributions

2025
[2]

The anatomy of a nanopublication,

P. Groth, A. Gibson, and J. Velterop, “The anatomy of a nanopublication,”Information Services and Use, vol. 30, no. 1-2, pp. 51–56, 2010

2010
[3]

Open research knowledge graph: Next generation infras- tructure for semantic scholarly knowledge,

M. Y . Jaradeh, A. Oelen, K. E. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker, and S. Auer, “Open research knowledge graph: Next generation infras- tructure for semantic scholarly knowledge,” inProceed- ings of the 10th International Conference on Knowledge Capture (K-CAP). ACM, 2019, pp. 243–246

2019
[4]

Jiacheng Miao, Joe R Davis, Yaohui Zhang, Jonathan K Pritchard, and James Zou

J. Miao, J. R. Davis, Y . Zhang, J. K. Pritchard, and J. Zou, “Paper2agent: Reimagining research papers as interactive and reliable ai agents,”arXiv preprint arXiv:2509.06917, 2025. [Online]. Available: https: //arxiv.org/abs/2509.06917

work page arXiv 2025
[5]

Agentic publications: redesigning scientific publishing in the age of thinking large language models

R. Pugliese, G. Kourousias, F. Venier, and G. Garlatti Costa, “Agentic publications: An LLM-driven framework for interactive scientific publishing, supple- menting traditional papers with AI-powered knowledge systems,”arXiv preprint arXiv:2505.13246, 2025. [Online]. Available: https://arxiv.org/abs/2505.13246

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Agentrxiv: Towards collaborative au- tonomous research,

S. Schmidgall and M. Moor, “Agentrxiv: Towards collaborative autonomous research,”arXiv preprint arXiv:2503.18102, 2025. [Online]. Available: https: //arxiv.org/abs/2503.18102

work page arXiv 2025
[7]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The AI scientist: Towards fully auto- mated open-ended scientific discovery,”arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review arXiv 2024
[8]

Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

NovelSeek Team, “NovelSeek: When agent becomes the scientist — building closed-loop system from hypothesis to verification,”arXiv preprint arXiv:2505.16938, 2025

work page arXiv 2025
[9]

Dolphin: Closed-loop open-ended auto-research through thinking, practice, and feedback.arXiv preprint arXiv:2501.03916, 2025

J. Yuan, X. Yan, B. Shiet al., “Dolphin: Closed-loop open-ended auto-research through thinking, practice, and feedback,”arXiv preprint arXiv:2501.03916, 2025

work page arXiv 2025
[10]

CodeScientist: End-to-end semi- automated scientific discovery with code-based ex- perimentation,

Allen Institute for AI, “CodeScientist: End-to-end semi- automated scientific discovery with code-based ex- perimentation,” https://github.com/allenai/codescientist, 2024, gitHub repository

2024
[11]

Data-to-paper: AI-driven research and documentation,

Kishony Lab, “Data-to-paper: AI-driven research and documentation,” https://github.com/Technion-Kishony- lab/data-to-paper, 2024, gitHub repository

2024
[12]

EvoScientist: Evolutionary scien- tific discovery platform,

EvoScientist Team, “EvoScientist: Evolutionary scien- tific discovery platform,” https://github.com/EvoScientist/ EvoScientist, 2025, gitHub repository

2025
[13]

Aigs: Generating science from ai-powered automated falsification.arXiv preprint arXiv:2411.11910, 2024

Baby-AIGS Team, “Toward automated scientific discov- ery: A survey on artificial intelligence generated science,” arXiv preprint arXiv:2411.11910, 2024

work page arXiv 2024
[14]

ResearchClawBench: A bench- mark for autonomous research agents,

InternScience Team, “ResearchClawBench: A bench- mark for autonomous research agents,” https://github. com/InternScience/ResearchClawBench, 2025, bench- mark for autonomous research agents

2025
[15]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Y . Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu, “SoK: Agentic skills – beyond tool use in LLM agents,” 2026. [Online]. Available: https://arxiv.org/abs/2602.20867

work page internal anchor Pith review arXiv 2026
[16]

title":

S. Chen, Q. Wang, G. Yu, X. Wang, and L. Zhu, “Clawed and dangerous: Can we trust open agentic systems?” 2026. [Online]. Available: https://arxiv.org/ abs/2603.26221 APPENDIXA SCHEMAREFERENCESUMMARY The complete JSON Schema v0.9 is released along- side the specification at https://knows.academy/; this ap- pendix summarizes the root-level structure and pro...

work page doi:10.1234/example 2026