Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Chengwei Qin; Jiayao Chen; Jufan Zhang; Linyi Yang; Shi Liu; Yanqing Hu

arxiv: 2606.11897 · v1 · pith:YAOMO27Tnew · submitted 2026-06-10 · 💻 cs.CL

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Shi Liu , Jiayao Chen , Chengwei Qin , Yanqing Hu , Jufan Zhang , Linyi Yang This is my paper

Pith reviewed 2026-06-27 09:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords lab notebooksscientific AI agentscertainty preservationwet-lab experimentstwo-stage frameworkAI for scienceuncertainty signalsagent skills

0 comments

The pith

A two-stage framework converts lab notebooks into reliable skills for AI agents by preserving author certainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Lab notebooks record observations along with tentative judgments and planned next steps, each carrying different degrees of author certainty. Most AI work on scientific text uses only final papers or structured data, leaving these informal notes unused. Notes2Skills applies a two-stage process to extract skills while keeping the original certainty signals intact. In tests spanning seven conditions and three wet-lab sessions, it is the only approach that avoids treating uncertain notes as firm instructions and also avoids discarding confirmed ones. The work positions certainty preservation as the key requirement for turning raw lab notes into usable agent capabilities.

Core claim

The paper claims that certainty preservation is the missing piece between lab notebooks and reliable agent skills. Notes2Skills achieves this through a two-stage process that identifies and maintains the author's uncertainty signals, making it the only tested configuration that neither mistakes uncertain scientific judgments for confirmed conclusions nor executable actions.

What carries the argument

Notes2Skills, a two-stage framework that turns raw lab notes into verifiable skills for scientific AI agents while preserving the author's certainty levels.

Load-bearing premise

The certainty signals present in raw lab notes can be reliably identified and preserved by a two-stage process without loss of scientific meaning or introduction of new errors.

What would settle it

A new collection of lab notes in which the framework either mislabels uncertain passages as firm or drops confirmed observations would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.11897 by Chengwei Qin, Jiayao Chen, Jufan Zhang, Linyi Yang, Shi Liu, Yanqing Hu.

**Figure 1.** Figure 1: Three eras of procedural text extraction. states set the temperature to 4℃. By the time such text is written, the author has typically resolved their uncertainty, leaving the extraction system to map explicit instructions into executable actions. Experimental notebooks violate this assumption. For example, the reading dropped sharply after five minutes states a fact; I am not sure the second read is reliab… view at source ↗

**Figure 2.** Figure 2: Three notebook genres carry the same epistemic mixture — factual observation, hedged judgment, and forward-looking suggestion — but express it in distinct surface registers. Blue marks the judgmental hedge; red marks the data-flagging fact; the last line in each panel is the forward-looking suggestion [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Notes2Skills pipeline. Our contributions have been highlighted with the yellow star. To our knowledge, we are the first to consider transferring single-author experimental notebooks to agent-loadable skills. Our contributions are threefold. First, we treat notebooks written by scientists as a new kind of procedural text, where author certainty serves as a safety boundary for agents. Second,… view at source ↗

**Figure 4.** Figure 4: Downstream validation setting. Notebook context links past experiments to later data-handling decisions. The same pipeline handles bilingual FreeNotes, semi-formal ONS, and prescriptive WLP, producing capsules whose certainty label and source link can be independently inspected before the agent acts. Appendix H reports a FreeNotes diagnostic against an earlier action-first representation. 6.3 Exp 3: Downst… view at source ↗

**Figure 5.** Figure 5: Closing the loop: a case study on the wet experiment. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Notes2Skills shows a workable two-stage way to turn raw lab notes into agent skills while keeping author uncertainty, with tests in actual wet-lab sessions.

read the letter

The main point is that Notes2Skills uses a two-stage process to pull skills from lab notebooks without flattening uncertain judgments into firm instructions or dropping solid observations. The tests across seven conditions in three real wet-lab sessions indicate it is the only setup among those compared that avoids both kinds of error.

What is new is the focus on informal notes rather than published papers or structured protocols. The paper explains why mixing tentative reasoning with confirmed results creates problems for AI agents in actual experiments, and it supplies a method that tries to separate those signals before skill extraction.

The work does a solid job describing the stages and grounding the evaluation in live lab sessions instead of purely synthetic cases. That gives the central claim some concrete backing that goes beyond the abstract.

The soft spots are the small scale of the testing—three sessions is limited even with multiple conditions—and the lack of detail on how certainty detection was validated across different note styles or labs. Those are real constraints on how far the results can be generalized right now.

This paper is for people building AI agents meant to work inside experimental workflows. Readers who care about uncertainty handling in scientific agents will get the most from it. The idea is practical enough and the evaluation real enough that it deserves a serious referee rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper introduces Notes2Skills, a two-stage framework (certainty detection followed by skill extraction) that converts informal lab notebooks into verifiable, certainty-aware skills for scientific AI agents. It reports an empirical evaluation across seven conditions and three wet-lab sessions in which Notes2Skills is the only configuration that avoids both mistaking uncertain notes for firm instructions and discarding firm ones, positioning certainty preservation as the key missing element for reliable agent skills.

Significance. If the empirical results hold, the work addresses a genuine gap in prior scientific-text processing (focused on papers and protocols) by showing how raw lab notes can be turned into agent skills without conflating uncertainty signals. The controlled comparison across multiple conditions and real wet-lab data supplies concrete evidence that certainty handling improves reliability, which could support safer AI co-scientist systems.

minor comments (3)

[§3.1] §3.1: the certainty-detection stage is described at a high level; adding one or two concrete examples of note passages with their detected certainty labels would clarify how the signals are operationalized.
[Table 2] Table 2: the column headers for the seven conditions use abbreviations that are not expanded in the caption; a footnote or expanded caption would improve readability.
[§4.3] §4.3: the wet-lab session protocol states that three sessions were used but does not report inter-session variability or any statistical test for the observed differences; adding this would strengthen the claim of robustness.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript and for recommending minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an empirical two-stage framework (certainty detection followed by skill extraction) evaluated across seven conditions and three wet-lab sessions. The central claim—that Notes2Skills uniquely avoids both error types—is supported by controlled experimental comparisons rather than any derivation, equations, fitted parameters renamed as predictions, or load-bearing self-citations. No self-definitional steps, ansatz smuggling, or renaming of known results appear; the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that lab notes contain extractable certainty signals that can be turned into agent skills without conflating uncertainty levels. No free parameters or invented physical entities are described.

axioms (1)

domain assumption Lab notes contain distinguishable certainty signals that can be identified and preserved when converting content into agent skills.
The entire value proposition of Notes2Skills depends on this being true; without it the two-stage process cannot avoid the stated failure modes.

invented entities (1)

Notes2Skills two-stage framework no independent evidence
purpose: To convert lab notebooks into verifiable, certainty-aware skills for scientific AI agents.
New framework introduced by the paper as the solution to the identified gap.

pith-pipeline@v0.9.1-grok · 5758 in / 1429 out tokens · 22683 ms · 2026-06-27T09:41:54.399486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages

[1]

Jacob Cohen

doi: 10.1038/s42256-024-00832-8. Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin, 70(4):213–220, 1968. doi: 10.1037/h0026256. John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured informa...

work page doi:10.1038/s42256-024-00832-8 1968
[2]

both-pos𝑛

URLhttps://arxiv.org/abs/2603.11863. arXiv:2603.11863 [cs.AI]. Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025. arXiv:2409.07429. Southern University of Science and Technology 15 Corpus Model / Cond.𝐹 hd 1 ↑𝐹 dt∗ 1 ↑𝐹 ep∗ 1 ↑QWK ...

Pith/arXiv arXiv 2025
[3]

Directivepreservation: thesetofEDEsegmentswith has_directive=1equalsthesetofcompiled capsules (1:1, no merging, no dropping)
[4]

Certaintyagreement: eachcapsule’sepistemic_statusequalsthesourceEDE epistemic_status
[5]

Source-linkchain: eachcapsule’sprovenance_ref.raw_excerpt_sha256equals sha256(corresponding Markdown blockquote), and the blockquote is a verbatim substring of the source segment’s raw_text
[6]

events observable, suggested as cross-read anchor

Schema validity: every capsule parses as JSON; everydefault_actionappears in the domain’s Action Vocabulary section; everyflag_scope matches the source EDE field; everyTRUNCATE_AT default carries atruncate_boundarysub-object. F Downstream Accuracy Confidence Intervals Table 8 reports Wilson 95% confidence intervals for file-majority accuracy in Table 6. I...

2000
[7]

- 0 if the segment is purely descriptive, a passing observation, an introduction, or a generic procedural step with no decision-changing content

has_directive (binary, 0 or 1) - 1 if the segment carries an actionable directive that should influence downstream pipeline behavior (changes a parameter, flags data quality, suggests an analysis, modifies the protocol, or changes a condition). - 0 if the segment is purely descriptive, a passing observation, an introduction, or a generic procedural step w...
[8]

- CONDITION_CHANGE: changing experimental conditions, sample types, reagent versions, or environmental setup

directive_type (5 classes, only when has_directive=1; otherwise null) - FLAG_DATA: a warning about data validity, contamination, exclusion, or quality concerns affecting downstream interpretation. - CONDITION_CHANGE: changing experimental conditions, sample types, reagent versions, or environmental setup. - ANALYSIS_SUGGESTION: a recommendation about how ...
[9]

seems",

epistemic_status (3 classes, only when has_directive=1; otherwise null) - FACT: the writer states the directive as a definite outcome. No hedging. - JUDGMENT: the writer expresses uncertainty, qualitative assessment, or a tentative interpretation ("seems", "may", "looks like"; bilingual hedge markers also listed for FreeNotes; see released code for exact ...
[10]

Uncertainty-to-intervention escalation: selectingTRUNCATE_AT, RAISE_THRESHOLD, orSKIP_FILE solely because evidence is weak or uncertain
[11]

might raise threshold

Hedge-as-command interpretation: treating phrases such as “might raise threshold” or “consider truncating” as mandatory operational instructions
[12]

needs review

Review-to-discard conversion: treating “needs review” or “requires inspection” as a reason to skip, truncate, or raise threshold
[13]

Boundaryinvention: selecting TRUNCATE_ATwithoutaspecifictimepointoreventboundarypresent in the skill or findings
[14]

Threshold invention: selectingRAISE_THRESHOLDwithout a concrete threshold value, adjustment magnitude, localized window, or explicit thresholding directive present in the skill or findings
[15]

Scope..These rules govern action selection for files whose skill content can be evaluated against a single, internally consistent epistemic state

Always-FLAG collapse: selecting FLAG_FOR_REVIEW for every file, including files for which an explicitSKIP_FILE,TRUNCATE_AT, orRAISE_THRESHOLDdirective is present in the skill content. Scope..These rules govern action selection for files whose skill content can be evaluated against a single, internally consistent epistemic state. The document does not spec...

[1] [1]

Jacob Cohen

doi: 10.1038/s42256-024-00832-8. Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin, 70(4):213–220, 1968. doi: 10.1037/h0026256. John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured informa...

work page doi:10.1038/s42256-024-00832-8 1968

[2] [2]

both-pos𝑛

URLhttps://arxiv.org/abs/2603.11863. arXiv:2603.11863 [cs.AI]. Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025. arXiv:2409.07429. Southern University of Science and Technology 15 Corpus Model / Cond.𝐹 hd 1 ↑𝐹 dt∗ 1 ↑𝐹 ep∗ 1 ↑QWK ...

Pith/arXiv arXiv 2025

[3] [3]

Directivepreservation: thesetofEDEsegmentswith has_directive=1equalsthesetofcompiled capsules (1:1, no merging, no dropping)

[4] [4]

Certaintyagreement: eachcapsule’sepistemic_statusequalsthesourceEDE epistemic_status

[5] [5]

Source-linkchain: eachcapsule’sprovenance_ref.raw_excerpt_sha256equals sha256(corresponding Markdown blockquote), and the blockquote is a verbatim substring of the source segment’s raw_text

[6] [6]

events observable, suggested as cross-read anchor

Schema validity: every capsule parses as JSON; everydefault_actionappears in the domain’s Action Vocabulary section; everyflag_scope matches the source EDE field; everyTRUNCATE_AT default carries atruncate_boundarysub-object. F Downstream Accuracy Confidence Intervals Table 8 reports Wilson 95% confidence intervals for file-majority accuracy in Table 6. I...

2000

[7] [7]

- 0 if the segment is purely descriptive, a passing observation, an introduction, or a generic procedural step with no decision-changing content

has_directive (binary, 0 or 1) - 1 if the segment carries an actionable directive that should influence downstream pipeline behavior (changes a parameter, flags data quality, suggests an analysis, modifies the protocol, or changes a condition). - 0 if the segment is purely descriptive, a passing observation, an introduction, or a generic procedural step w...

[8] [8]

- CONDITION_CHANGE: changing experimental conditions, sample types, reagent versions, or environmental setup

directive_type (5 classes, only when has_directive=1; otherwise null) - FLAG_DATA: a warning about data validity, contamination, exclusion, or quality concerns affecting downstream interpretation. - CONDITION_CHANGE: changing experimental conditions, sample types, reagent versions, or environmental setup. - ANALYSIS_SUGGESTION: a recommendation about how ...

[9] [9]

seems",

epistemic_status (3 classes, only when has_directive=1; otherwise null) - FACT: the writer states the directive as a definite outcome. No hedging. - JUDGMENT: the writer expresses uncertainty, qualitative assessment, or a tentative interpretation ("seems", "may", "looks like"; bilingual hedge markers also listed for FreeNotes; see released code for exact ...

[10] [10]

Uncertainty-to-intervention escalation: selectingTRUNCATE_AT, RAISE_THRESHOLD, orSKIP_FILE solely because evidence is weak or uncertain

[11] [11]

might raise threshold

Hedge-as-command interpretation: treating phrases such as “might raise threshold” or “consider truncating” as mandatory operational instructions

[12] [12]

needs review

Review-to-discard conversion: treating “needs review” or “requires inspection” as a reason to skip, truncate, or raise threshold

[13] [13]

Boundaryinvention: selecting TRUNCATE_ATwithoutaspecifictimepointoreventboundarypresent in the skill or findings

[14] [14]

Threshold invention: selectingRAISE_THRESHOLDwithout a concrete threshold value, adjustment magnitude, localized window, or explicit thresholding directive present in the skill or findings

[15] [15]

Scope..These rules govern action selection for files whose skill content can be evaluated against a single, internally consistent epistemic state

Always-FLAG collapse: selecting FLAG_FOR_REVIEW for every file, including files for which an explicitSKIP_FILE,TRUNCATE_AT, orRAISE_THRESHOLDdirective is present in the skill content. Scope..These rules govern action selection for files whose skill content can be evaluated against a single, internally consistent epistemic state. The document does not spec...