Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills
Pith reviewed 2026-06-27 09:41 UTC · model grok-4.3
The pith
A two-stage framework converts lab notebooks into reliable skills for AI agents by preserving author certainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that certainty preservation is the missing piece between lab notebooks and reliable agent skills. Notes2Skills achieves this through a two-stage process that identifies and maintains the author's uncertainty signals, making it the only tested configuration that neither mistakes uncertain scientific judgments for confirmed conclusions nor executable actions.
What carries the argument
Notes2Skills, a two-stage framework that turns raw lab notes into verifiable skills for scientific AI agents while preserving the author's certainty levels.
Load-bearing premise
The certainty signals present in raw lab notes can be reliably identified and preserved by a two-stage process without loss of scientific meaning or introduction of new errors.
What would settle it
A new collection of lab notes in which the framework either mislabels uncertain passages as firm or drops confirmed observations would show the central claim does not hold.
Figures
read the original abstract
Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Notes2Skills, a two-stage framework (certainty detection followed by skill extraction) that converts informal lab notebooks into verifiable, certainty-aware skills for scientific AI agents. It reports an empirical evaluation across seven conditions and three wet-lab sessions in which Notes2Skills is the only configuration that avoids both mistaking uncertain notes for firm instructions and discarding firm ones, positioning certainty preservation as the key missing element for reliable agent skills.
Significance. If the empirical results hold, the work addresses a genuine gap in prior scientific-text processing (focused on papers and protocols) by showing how raw lab notes can be turned into agent skills without conflating uncertainty signals. The controlled comparison across multiple conditions and real wet-lab data supplies concrete evidence that certainty handling improves reliability, which could support safer AI co-scientist systems.
minor comments (3)
- [§3.1] §3.1: the certainty-detection stage is described at a high level; adding one or two concrete examples of note passages with their detected certainty labels would clarify how the signals are operationalized.
- [Table 2] Table 2: the column headers for the seven conditions use abbreviations that are not expanded in the caption; a footnote or expanded caption would improve readability.
- [§4.3] §4.3: the wet-lab session protocol states that three sessions were used but does not report inter-session variability or any statistical test for the observed differences; adding this would strengthen the claim of robustness.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the manuscript and for recommending minor revision. No major comments were provided in the report.
Circularity Check
No significant circularity detected
full rationale
The manuscript describes an empirical two-stage framework (certainty detection followed by skill extraction) evaluated across seven conditions and three wet-lab sessions. The central claim—that Notes2Skills uniquely avoids both error types—is supported by controlled experimental comparisons rather than any derivation, equations, fitted parameters renamed as predictions, or load-bearing self-citations. No self-definitional steps, ansatz smuggling, or renaming of known results appear; the argument is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lab notes contain distinguishable certainty signals that can be identified and preserved when converting content into agent skills.
invented entities (1)
-
Notes2Skills two-stage framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s42256-024-00832-8. Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin, 70(4):213–220, 1968. doi: 10.1037/h0026256. John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured informa...
-
[2]
URLhttps://arxiv.org/abs/2603.11863. arXiv:2603.11863 [cs.AI]. Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025. arXiv:2409.07429. Southern University of Science and Technology 15 Corpus Model / Cond.𝐹 hd 1 ↑𝐹 dt∗ 1 ↑𝐹 ep∗ 1 ↑QWK ...
Pith/arXiv arXiv 2025
-
[3]
Directivepreservation: thesetofEDEsegmentswith has_directive=1equalsthesetofcompiled capsules (1:1, no merging, no dropping)
-
[4]
Certaintyagreement: eachcapsule’sepistemic_statusequalsthesourceEDE epistemic_status
-
[5]
Source-linkchain: eachcapsule’sprovenance_ref.raw_excerpt_sha256equals sha256(corresponding Markdown blockquote), and the blockquote is a verbatim substring of the source segment’s raw_text
-
[6]
events observable, suggested as cross-read anchor
Schema validity: every capsule parses as JSON; everydefault_actionappears in the domain’s Action Vocabulary section; everyflag_scope matches the source EDE field; everyTRUNCATE_AT default carries atruncate_boundarysub-object. F Downstream Accuracy Confidence Intervals Table 8 reports Wilson 95% confidence intervals for file-majority accuracy in Table 6. I...
2000
-
[7]
- 0 if the segment is purely descriptive, a passing observation, an introduction, or a generic procedural step with no decision-changing content
has_directive (binary, 0 or 1) - 1 if the segment carries an actionable directive that should influence downstream pipeline behavior (changes a parameter, flags data quality, suggests an analysis, modifies the protocol, or changes a condition). - 0 if the segment is purely descriptive, a passing observation, an introduction, or a generic procedural step w...
-
[8]
- CONDITION_CHANGE: changing experimental conditions, sample types, reagent versions, or environmental setup
directive_type (5 classes, only when has_directive=1; otherwise null) - FLAG_DATA: a warning about data validity, contamination, exclusion, or quality concerns affecting downstream interpretation. - CONDITION_CHANGE: changing experimental conditions, sample types, reagent versions, or environmental setup. - ANALYSIS_SUGGESTION: a recommendation about how ...
-
[9]
seems",
epistemic_status (3 classes, only when has_directive=1; otherwise null) - FACT: the writer states the directive as a definite outcome. No hedging. - JUDGMENT: the writer expresses uncertainty, qualitative assessment, or a tentative interpretation ("seems", "may", "looks like"; bilingual hedge markers also listed for FreeNotes; see released code for exact ...
-
[10]
Uncertainty-to-intervention escalation: selectingTRUNCATE_AT, RAISE_THRESHOLD, orSKIP_FILE solely because evidence is weak or uncertain
-
[11]
might raise threshold
Hedge-as-command interpretation: treating phrases such as “might raise threshold” or “consider truncating” as mandatory operational instructions
-
[12]
needs review
Review-to-discard conversion: treating “needs review” or “requires inspection” as a reason to skip, truncate, or raise threshold
-
[13]
Boundaryinvention: selecting TRUNCATE_ATwithoutaspecifictimepointoreventboundarypresent in the skill or findings
-
[14]
Threshold invention: selectingRAISE_THRESHOLDwithout a concrete threshold value, adjustment magnitude, localized window, or explicit thresholding directive present in the skill or findings
-
[15]
Scope..These rules govern action selection for files whose skill content can be evaluated against a single, internally consistent epistemic state
Always-FLAG collapse: selecting FLAG_FOR_REVIEW for every file, including files for which an explicitSKIP_FILE,TRUNCATE_AT, orRAISE_THRESHOLDdirective is present in the skill content. Scope..These rules govern action selection for files whose skill content can be evaluated against a single, internally consistent epistemic state. The document does not spec...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.