arxiv: 2602.20669 · v2 · submitted 2026-02-24 · ⚛️ physics.app-ph · cond-mat.mtrl-sci

Recognition: 2 theorem links

· Lean Theorem

Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation

Zhuo Diao , Kouma Matsumoto , Linfeng Hou , Masahiro Ohara , Hayato Yamashita , Masayuki Abe

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:03 UTC · model grok-4.3

classification ⚛️ physics.app-ph cond-mat.mtrl-sci

keywords scanning probe microscopylanguage modelsatomic resolutionself-driving laboratoriesdeterministic controldomain adaptationfine-tuningAI measurement tools

0 comments

The pith

Fine-tuned small language models achieve deterministic real-time atomic-resolution scanning probe microscopy experiments at room temperature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that specializing small language models through fine-tuning and pairing them with AI measurement tools enables instruction-level control and multi-step planning for atomic-resolution experiments. This setup works at room temperature while enforcing deterministic execution on consumer hardware, cutting computational costs compared to general large models. A reader would care because it turns probabilistic AI outputs into reliable physical control under strict precision constraints, opening a route to automated labs that do not need massive computing resources.

Core claim

By fine-tuning small language models for scanning probe microscopy tasks and integrating them with AI-driven measurement tools, the authors achieve real-time atomic-resolution experiments at room temperature with instruction-level control and multi-step experimental planning. The adapted models reduce perplexity from 1.44 to 1.20, reach command accuracies of 99.3% and 95.2%, and outperform OpenAI o4-mini on domain-specific tasks while maintaining lower computational cost and deterministic behavior suitable for consumer-grade hardware.

What carries the argument

A modular architecture that specializes small language models for SPM control by coordinating task-specific models with AI measurement tools to enforce deterministic execution.

Load-bearing premise

Fine-tuned small language models can reliably coordinate with AI measurement tools to enforce deterministic execution under the strict physical constraints of room-temperature atomic-resolution SPM without introducing control errors or requiring extensive post-hoc adjustments.

What would settle it

A multi-step SPM procedure in which the fine-tuned model issues a command sequence that produces non-atomic-resolution outcomes or requires manual correction to complete the experiment.

Figures

Figures reproduced from arXiv: 2602.20669 by Hayato Yamashita, Kouma Matsumoto, Linfeng Hou, Masahiro Ohara, Masayuki Abe, Zhuo Diao.

read the original abstract

Self-driving laboratories based on large language models promise to transform scientific discovery through general experimental automation. However, realizing this vision on precision platforms remains challenging, requiring deterministic execution and effective domain adaptation under strict physical constraints. We address these requirements through a framework that specializes in small language models for autonomous control of scanning probe microscopy, coordinating task-specific models with AI-driven measurement tools. We demonstrate real-time, atomic-resolution SPM experiments at room temperature, achieving instruction-level control and multi-step experimental planning. Fine-tuning reduces perplexity from 1.44 to 1.20 and improves reliability, with the adapted model reaching 99.3% and 95.2% command accuracy, outperforming OpenAI o4-mini on domain-specific tasks. This architecture achieves lower computational cost while maintaining deterministic execution and enabling deployment on consumer-grade hardware. This work bridges probabilistic language models with deterministic experimental control through a modular, domain-specialized architecture, providing a generalizable pathway toward scalable and trustworthy self-driving laboratories across diverse scientific platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small fine-tuned models deliver high command accuracy for SPM control on cheap hardware, but the determinism under real physical noise still needs concrete validation steps.

read the letter

The main point is that this paper shows a working modular setup where fine-tuned small language models handle instruction-level commands and multi-step planning for atomic-resolution scanning probe microscopy at room temperature. They cut perplexity from 1.44 to 1.20 and hit 99.3% and 95.2% command accuracy on domain tasks, beating OpenAI o4-mini while running on consumer hardware. That combination of specialization and lower compute cost is the practical advance over generic self-driving lab approaches.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a framework that integrates domain-specialized small language models with AI measurement tools to enable deterministic control of scanning probe microscopy (SPM). It claims real-time atomic-resolution experiments at room temperature with instruction-level control and multi-step planning. Fine-tuning reduces perplexity from 1.44 to 1.20, yielding 99.3% and 95.2% command accuracy while outperforming OpenAI o4-mini on domain tasks, at lower computational cost and with deterministic execution on consumer hardware.

Significance. If the determinism and reliability under physical constraints are substantiated, the work would be significant for self-driving laboratories in precision instrumentation. The modular use of small models for efficiency and the reported outperformance on domain tasks provide practical strengths. It offers a potential pathway for trustworthy automation across platforms, but the translation of accuracy metrics to error-free hardware trajectories requires explicit validation.

major comments (2)

[Abstract] Abstract: The central determinism claim rests on 99.3% and 95.2% command accuracy, yet no experimental protocol, validation dataset size, trial count, error bars, or verification against physical constraints (thermal drift, piezo hysteresis, tip-sample forces) is supplied, leaving the load-bearing guarantee under-supported.
[Abstract] Abstract and architecture description: The coordination of the fine-tuned model with AI measurement tools is asserted to enforce deterministic execution, but the text does not specify an explicit validator, rejection loop, or recovery protocol that would prevent residual probabilistic errors from producing control failures in multi-step room-temperature SPM runs.

minor comments (1)

[Abstract] Abstract: The perplexity reduction (1.44 to 1.20) is reported without stating the evaluation corpus or baseline model details, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We agree that the determinism claims require more explicit supporting details on validation protocols and error-handling mechanisms. We will revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses

Referee: [Abstract] Abstract: The central determinism claim rests on 99.3% and 95.2% command accuracy, yet no experimental protocol, validation dataset size, trial count, error bars, or verification against physical constraints (thermal drift, piezo hysteresis, tip-sample forces) is supplied, leaving the load-bearing guarantee under-supported.

Authors: We acknowledge that the abstract does not currently include these validation specifics. In the revised manuscript we will expand the abstract to reference the experimental protocol, including dataset size, trial counts, error bars from repeated runs, and explicit verification steps against physical constraints such as thermal drift and piezo hysteresis. These details will be drawn from the full experimental results already obtained and will be elaborated in the Methods and Results sections to better substantiate the determinism guarantee. revision: yes
Referee: [Abstract] Abstract and architecture description: The coordination of the fine-tuned model with AI measurement tools is asserted to enforce deterministic execution, but the text does not specify an explicit validator, rejection loop, or recovery protocol that would prevent residual probabilistic errors from producing control failures in multi-step room-temperature SPM runs.

Authors: We agree that the current description of the coordination mechanism is insufficiently explicit on error mitigation. We will revise the architecture section to describe the validator module that enforces physical constraints, the rejection loop for low-confidence outputs, and the recovery protocol that triggers safe-state fallback or replanning. These additions will clarify how the modular integration converts probabilistic model outputs into deterministic hardware trajectories and will be supported by a revised schematic figure. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on reported experimental metrics without self-referential derivations

full rationale

The manuscript reports empirical outcomes from fine-tuning small language models on domain-specific SPM data, including measured perplexity reduction (1.44 to 1.20) and command accuracies (99.3% / 95.2%). These are presented as direct results of adaptation and coordination with AI measurement tools, not as quantities derived from or equivalent to the inputs by construction. No equations, ansatzes, uniqueness theorems, or self-citations are shown to load-bear the central determinism claim; the architecture is described modularly with experimental validation at room temperature. This satisfies the default non-circular expectation for an applied experimental paper whose core assertions are falsifiable via hardware trajectories rather than reducing to fitted parameters renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the domain assumption that language models can be specialized via fine-tuning to produce reliable control signals for physical instruments without additional ad-hoc parameters beyond standard training procedures.

axioms (1)

domain assumption Small language models can be fine-tuned on domain data to achieve high command accuracy and deterministic behavior in scientific instrument control.
This is the core premise enabling the reported performance gains and real-time operation.

pith-pipeline@v0.9.0 · 5499 in / 1212 out tokens · 27900 ms · 2026-05-15T20:03:50.795927+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fine-tuning reduces perplexity from 1.44 to 1.20 and improves reliability, with the adapted model reaching 99.3% and 95.2% command accuracy
IndisputableMonolith/Foundation/Atomicity.lean atomic_tick unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Text parser … validates command completeness and correctness before issuing control signals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

Liu, et al

Y . Liu, et al. , Autonomous scanning probe microscopy with hypothesis lea rning: Explor- ing the physics of domain switching in ferroelectric materi als. Patterns 4 (3), 100704 (2023), doi:https://doi.org/10.1016/j.patter.2023.10 0704, https://www.sciencedirect. com/science/article/pii/S2666389923000417

work page doi:10.1016/j.patter.2023.10 2023
[2]

Pratiush, H

U. Pratiush, H. Funakubo, R. Vasudevan, S. V . Kalinin, Y . Liu, Scientiﬁc exploration with expert knowledge (SEEK) in autonomous scanning probe micro scopy with active learning. Digital Discovery 4, 252–263 (2025), doi:10.1039/D4DD00277F, http://dx.doi.org/10. 1039/D4DD00277F

work page doi:10.1039/d4dd00277f 2025
[3]

S. B. Harris, R. Vasudevan, Y . Liu, Active oversight and q uality control in standard Bayesian optimization for autonomous experiments. npj Computational Materials 11 (1), 23 (2025), doi:10.1038/s41524-024-01485-2, https://doi.org/10.1038/s41524-024-01485-2

work page doi:10.1038/s41524-024-01485-2 2025
[4]

Diao, et al., AI-Equipped Scanning Probe Microscopy for Autonomous Site-Speciﬁc Atomic- Level Characterization at Room Temperature

Z. Diao, et al., AI-Equipped Scanning Probe Microscopy for Autonomous Site-Speciﬁc Atomic- Level Characterization at Room Temperature. Small Methods 9 (1), 2400813 (2025), doi: https://doi.org/10.1002/smtd.202400813, https://doi.org/10.1002/smtd.202400813

work page doi:10.1002/smtd.202400813 2025
[5]

Sung, et al

J. Sung, et al. , Autonomous AI-Driven Measurement and Characterization o f 2D Materials Using Scanning Probe Microscopy. Small Structures 6 (12), e202500379 (2025), doi:https: //doi.org/10.1002/sstr.202500379, https://doi.org/10.1002/sstr.202500379

work page doi:10.1002/sstr.202500379 2025
[6]

Diao, et al

Z. Diao, et al. , Automatic drift compensation for nanoscale imaging using feature point matching. Applied Physics Letters 122 (12), 121601 (2023), doi:10.1063/5.0139330, https: //doi.org/10.1063/5.0139330

work page doi:10.1063/5.0139330 2023
[7]

D. G. Deveci, et al. , Comprehensive analysis and machine learning-based solut ions for drift behavior in ambient Atomic Force Microscope conditions. Engineering Applications of Ar- tiﬁcial Intelligence 159, 111678 (2025), doi:https://doi.org/10.1016/j.engappa i.2025.111678, https://www.sciencedirect.com/science/article/pii/S095219762501680X

work page doi:10.1016/j.engappa 2025
[8]

Z. Diao, L. Hou, M. Abe, Probe conditioning via convoluti on neural network for scanning probe microscopy automation. Applied Physics Express 16 (8), 085002 (2023), doi:10.35848/ 1882-0786/acecd6, https://doi.org/10.35848/1882-0786/acecd6. 24

work page doi:10.35848/1882-0786/acecd6 2023
[9]

Krull, P

A. Krull, P . Hirsch, C. Rother, A. Schiﬀrin, C. Krull, Art iﬁcial-intelligence-driven scanning probe microscopy.Communications Physics 3 (1), 54 (2020), doi:10.1038/s42005-020-0317-3, https://doi.org/10.1038/s42005-020-0317-3

work page doi:10.1038/s42005-020-0317-3 2020
[10]

A. M. Bran, et al., Augmenting large language models with chemistry tools. Nature Machine Intelligence 6 (5), 525–535 (2024), doi:10.1038/s42256-024-00832-8, https://doi.org/ 10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[11]

Z. Liu, Y . Chai, J. Li, Toward Automated Simulation Research Workﬂow through LLM Prompt Engineering Design. Journal of Chemical Information and Modeling 65 (1), 114–124 (2025), doi:10.1021/acs.jcim.4c01653, https://doi.org/10.1021/acs.jcim.4c01653

work page doi:10.1021/acs.jcim.4c01653 2025
[12]

M. H. Prince, et al. , Opportunities for retrieval and tool augmented large lang uage mod- els in scientiﬁc facilities. npj Computational Materials 10 (1), 251 (2024), doi:10.1038/ s41524-024-01423-2, https://doi.org/10.1038/s41524-024-01423-2

work page doi:10.1038/s41524-024-01423-2 2024
[13]

D. A. Boiko, R. MacKnight, B. Kline, G. Gomes, Autonomous chemical research with large language models. Nature 624 (7992), 570–578 (2023), doi:10.1038/s41586-023-06792-0 , https://doi.org/10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[14]

Y . Xie, K. He, A. Castellanos-Gomez, Toward Full Autonom ous Laboratory Instrumentation Control with Large Language Models. Small Structures 6 (8), 2500173 (2025), doi:https: //doi.org/10.1002/sstr.202500173, https://doi.org/10.1002/sstr.202500173

work page doi:10.1002/sstr.202500173 2025
[15]

Y . Liu, M. Checa, R. K. Vasudevan, Synergizing human expe rtise and AI eﬃciency with language model for microscopy operation and automated expe riment design*. Machine Learning: Science and Technology 5 (2), 02LT01 (2024), doi:10.1088/2632-2153/ad52e9, https://doi.org/10.1088/2632-2153/ad52e9

work page doi:10.1088/2632-2153/ad52e9 2024
[16]

Mandal, et al

I. Mandal, et al. , Evaluating large language model agents for automation of a tomic force microscopy. Nature Communications 16 (1), 9104 (2025), doi:10.1038/s41467-025-64105-7, https://doi.org/10.1038/s41467-025-64105-7

work page doi:10.1038/s41467-025-64105-7 2025
[17]

Z. Diao, H. Y amashita, M. Abe, Leveraging large language model and social network service for automation in scanning probe microscopy. Measurement Science and Technology 36 (4), 25 047001 (2025), doi:10.1088/1361-6501/adbf3a, https://doi.org/10.1088/1361-6501/ adbf3a

work page doi:10.1088/1361-6501/adbf3a 2025
[18]

Z. Xu, S. Jain, M. Kankanhalli, Hallucination is Inevita ble: An Innate Limitation of Large Language Models (2025), https://arxiv.org/abs/2401.11817

work page internal anchor Pith review arXiv 2025
[19]

Chen, et al., Precise atom manipulation through deep reinforcement learning

I.-J. Chen, et al., Precise atom manipulation through deep reinforcement learning. Nature Com- munications 13 (1), 7499 (2022), doi:10.1038/s41467-022-35149-w, https://doi.org/10. 1038/s41467-022-35149-w

work page doi:10.1038/s41467-022-35149-w 2022
[20]

Okuyama, Z

J. Okuyama, Z. Diao, H. Y amashita, M. Abe, Integrated AI F ramework for Room-Temperature Atom Manipulation in Scanning Probe Microscopy.Nano Letters25 (51), 17771–17777 (2025), doi:10.1021/acs.nanolett.5c04982, https://doi.org/10.1021/acs.nanolett.5c04982

work page doi:10.1021/acs.nanolett.5c04982 2025
[21]

Su, et al

J. Su, et al. , Intelligent synthesis of magnetic nanographenes via chem ist-intuited atomic robotic probe. Nature Synthesis 3 (4), 466–476 (2024), doi:10.1038/s44160-024-00488-7, https://doi.org/10.1038/s44160-024-00488-7

work page doi:10.1038/s44160-024-00488-7 2024
[22]

Zhu, et al

Z. Zhu, et al. , Deep learning drives autonomous molecular reactions with single-bond se- lectivity in tetra-brominated porphyrins on Au(111). Nature Communications (2026), doi: 10.1038/s41467-026-69080-1, https://doi.org/10.1038/s41467-026-69080-1

work page doi:10.1038/s41467-026-69080-1 2026
[23]

Miret, N

S. Miret, N. M. A. Krishnan, Enabling large language models for real-world materials discovery. Nature Machine Intelligence7 (7), 991–998 (2025), doi:10.1038/s42256-025-01058-y, https: //doi.org/10.1038/s42256-025-01058-y

work page doi:10.1038/s42256-025-01058-y 2025
[24]

Alampara, et al

N. Alampara, et al. , Probing the limitations of multimodal language models for chemistry and materials research. Nature Computational Science 5 (10), 952–961 (2025), doi:10.1038/ s43588-025-00836-3, https://doi.org/10.1038/s43588-025-00836-3

work page doi:10.1038/s43588-025-00836-3 2025
[25]

M. Shen, M. Umar, K. Maeng, G. E. Suh, U. Gupta, Towards Und erstanding Systems Trade- oﬀs in Retrieval-Augmented Generation Model Inference (20 24), https://arxiv.org/abs/ 2412.11854. 26

work page arXiv
[26]

Zheng, et al., A Review on Edge Large Language Models: Design, Execution, and Appli- cations

Y . Zheng, et al., A Review on Edge Large Language Models: Design, Execution, and Appli- cations. ACM Comput. Surv. 57 (8) (2025), doi:10.1145/3719664, https://doi.org/10. 1145/3719664

work page doi:10.1145/3719664 2025
[27]

Luccioni, Y

S. Luccioni, Y . Jernite, E. Strubell, Power Hungry Processing: Watts Driving the Cost of AI De- ployment? (2024), doi:10.1145/3630106.3658542, https://doi.org/10.1145/3630106. 3658542

work page doi:10.1145/3630106.3658542 2024
[28]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, et al., LLaMA: Open and Eﬃcient Foundation Language Models (2023) , https: //arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

A. Q. Jiang, et al., Mistral 7B (2023), https://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Phi-4 Technical Report

M. Abdin, et al., Phi-4 Technical Report (2024), https://arxiv.org/abs/2412.08905

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Z. Diao, H. Y amashita, M. Abe, A metaverse laboratory setup for interactive atom visualization and manipulation with scanning probe microscopy. Scientiﬁc Reports 15 (1), 17490 (2025), doi:10.1038/s41598-025-01578-y, https://doi.org/10.1038/s41598-025-01578-y

work page doi:10.1038/s41598-025-01578-y 2025
[32]

E. J. Hu, et al. , LoRA: Low-Rank Adaptation of Large Language Models (2021) , https: //arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, Y . Artzi, BERTScore: Evaluating Text Gener- ation with BERT (2020), https://arxiv.org/abs/1904.09675

work page internal anchor Pith review Pith/arXiv arXiv 2020
[34]

findings-emnlp.599/

Y . Liu, et al. , G-Eval: NLG Evaluation using Gpt-4 with Better Human Align ment, in Pro- ceedings of the 2023 Conference on Empirical Methods in Natu ral Language Processing , H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computat ional Linguistics, Singapore) (2023), pp. 2511–2522, doi:10.18653/v1/2023.emnlp-main .153, https://aclanthology. org/2...

work page doi:10.18653/v1/2023.emnlp-main 2023
[35]

S. V . Kalinin, et al. , Machine learning for automated experimentation in scanni ng trans- mission electron microscopy. npj Computational Materials 9 (1), 227 (2023), doi:10.1038/ s41524-023-01142-0, https://doi.org/10.1038/s41524-023-01142-0. 27

work page doi:10.1038/s41524-023-01142-0 2023
[36]

Leitherer, B

A. Leitherer, B. C. Y eo, C. H. Liebscher, L. M. Ghiringhel li, Automatic identiﬁcation of crystal structures and interfaces via artiﬁcial-intell igence-based electron microscopy. npj Computational Materials 9 (1), 179 (2023), doi:10.1038/s41524-023-01133-1, https: //doi.org/10.1038/s41524-023-01133-1

work page doi:10.1038/s41524-023-01133-1 2023
[37]

Lannelongue, J

L. Lannelongue, J. Grealey, M. Inouye, Green Algorithms : Quantifying the Carbon Footprint of Computation. Advanced Science 8 (12), 2100707 (2021), doi:https://doi.org/10.1002/advs . 202100707, https://doi.org/10.1002/advs.202100707

work page doi:10.1002/advs 2021
[38]

Samsi, et al

S. Samsi, et al. , From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference (2023), https://arxiv.org/abs/2310.03003

work page arXiv 2023
[39]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter, Decoupled Weight Decay Regula rization (2019), https://arxiv. org/abs/1711.05101. Acknowledgments Funding: This work was supported by Grants-in-Aid for Scientiﬁc Research (24K21716, 25K17654) from the Ministry of Education, Culture, Sports, Science an d Technology of Japan. A part of MA work is supported by JKA and its promotion ...

work page internal anchor Pith review Pith/arXiv arXiv 2019