pith. machine review for the scientific record. sign in

arxiv: 2602.20669 · v2 · submitted 2026-02-24 · ⚛️ physics.app-ph · cond-mat.mtrl-sci

Recognition: 2 theorem links

· Lean Theorem

Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:03 UTC · model grok-4.3

classification ⚛️ physics.app-ph cond-mat.mtrl-sci
keywords scanning probe microscopylanguage modelsatomic resolutionself-driving laboratoriesdeterministic controldomain adaptationfine-tuningAI measurement tools
0
0 comments X

The pith

Fine-tuned small language models achieve deterministic real-time atomic-resolution scanning probe microscopy experiments at room temperature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that specializing small language models through fine-tuning and pairing them with AI measurement tools enables instruction-level control and multi-step planning for atomic-resolution experiments. This setup works at room temperature while enforcing deterministic execution on consumer hardware, cutting computational costs compared to general large models. A reader would care because it turns probabilistic AI outputs into reliable physical control under strict precision constraints, opening a route to automated labs that do not need massive computing resources.

Core claim

By fine-tuning small language models for scanning probe microscopy tasks and integrating them with AI-driven measurement tools, the authors achieve real-time atomic-resolution experiments at room temperature with instruction-level control and multi-step experimental planning. The adapted models reduce perplexity from 1.44 to 1.20, reach command accuracies of 99.3% and 95.2%, and outperform OpenAI o4-mini on domain-specific tasks while maintaining lower computational cost and deterministic behavior suitable for consumer-grade hardware.

What carries the argument

A modular architecture that specializes small language models for SPM control by coordinating task-specific models with AI measurement tools to enforce deterministic execution.

Load-bearing premise

Fine-tuned small language models can reliably coordinate with AI measurement tools to enforce deterministic execution under the strict physical constraints of room-temperature atomic-resolution SPM without introducing control errors or requiring extensive post-hoc adjustments.

What would settle it

A multi-step SPM procedure in which the fine-tuned model issues a command sequence that produces non-atomic-resolution outcomes or requires manual correction to complete the experiment.

Figures

Figures reproduced from arXiv: 2602.20669 by Hayato Yamashita, Kouma Matsumoto, Linfeng Hou, Masahiro Ohara, Masayuki Abe, Zhuo Diao.

Figure 4
Figure 4. Figure 4: For the router SLM evaluation, we use 642 samples from [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Self-driving laboratories based on large language models promise to transform scientific discovery through general experimental automation. However, realizing this vision on precision platforms remains challenging, requiring deterministic execution and effective domain adaptation under strict physical constraints. We address these requirements through a framework that specializes in small language models for autonomous control of scanning probe microscopy, coordinating task-specific models with AI-driven measurement tools. We demonstrate real-time, atomic-resolution SPM experiments at room temperature, achieving instruction-level control and multi-step experimental planning. Fine-tuning reduces perplexity from 1.44 to 1.20 and improves reliability, with the adapted model reaching 99.3% and 95.2% command accuracy, outperforming OpenAI o4-mini on domain-specific tasks. This architecture achieves lower computational cost while maintaining deterministic execution and enabling deployment on consumer-grade hardware. This work bridges probabilistic language models with deterministic experimental control through a modular, domain-specialized architecture, providing a generalizable pathway toward scalable and trustworthy self-driving laboratories across diverse scientific platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a framework that integrates domain-specialized small language models with AI measurement tools to enable deterministic control of scanning probe microscopy (SPM). It claims real-time atomic-resolution experiments at room temperature with instruction-level control and multi-step planning. Fine-tuning reduces perplexity from 1.44 to 1.20, yielding 99.3% and 95.2% command accuracy while outperforming OpenAI o4-mini on domain tasks, at lower computational cost and with deterministic execution on consumer hardware.

Significance. If the determinism and reliability under physical constraints are substantiated, the work would be significant for self-driving laboratories in precision instrumentation. The modular use of small models for efficiency and the reported outperformance on domain tasks provide practical strengths. It offers a potential pathway for trustworthy automation across platforms, but the translation of accuracy metrics to error-free hardware trajectories requires explicit validation.

major comments (2)
  1. [Abstract] Abstract: The central determinism claim rests on 99.3% and 95.2% command accuracy, yet no experimental protocol, validation dataset size, trial count, error bars, or verification against physical constraints (thermal drift, piezo hysteresis, tip-sample forces) is supplied, leaving the load-bearing guarantee under-supported.
  2. [Abstract] Abstract and architecture description: The coordination of the fine-tuned model with AI measurement tools is asserted to enforce deterministic execution, but the text does not specify an explicit validator, rejection loop, or recovery protocol that would prevent residual probabilistic errors from producing control failures in multi-step room-temperature SPM runs.
minor comments (1)
  1. [Abstract] Abstract: The perplexity reduction (1.44 to 1.20) is reported without stating the evaluation corpus or baseline model details, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We agree that the determinism claims require more explicit supporting details on validation protocols and error-handling mechanisms. We will revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central determinism claim rests on 99.3% and 95.2% command accuracy, yet no experimental protocol, validation dataset size, trial count, error bars, or verification against physical constraints (thermal drift, piezo hysteresis, tip-sample forces) is supplied, leaving the load-bearing guarantee under-supported.

    Authors: We acknowledge that the abstract does not currently include these validation specifics. In the revised manuscript we will expand the abstract to reference the experimental protocol, including dataset size, trial counts, error bars from repeated runs, and explicit verification steps against physical constraints such as thermal drift and piezo hysteresis. These details will be drawn from the full experimental results already obtained and will be elaborated in the Methods and Results sections to better substantiate the determinism guarantee. revision: yes

  2. Referee: [Abstract] Abstract and architecture description: The coordination of the fine-tuned model with AI measurement tools is asserted to enforce deterministic execution, but the text does not specify an explicit validator, rejection loop, or recovery protocol that would prevent residual probabilistic errors from producing control failures in multi-step room-temperature SPM runs.

    Authors: We agree that the current description of the coordination mechanism is insufficiently explicit on error mitigation. We will revise the architecture section to describe the validator module that enforces physical constraints, the rejection loop for low-confidence outputs, and the recovery protocol that triggers safe-state fallback or replanning. These additions will clarify how the modular integration converts probabilistic model outputs into deterministic hardware trajectories and will be supported by a revised schematic figure. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on reported experimental metrics without self-referential derivations

full rationale

The manuscript reports empirical outcomes from fine-tuning small language models on domain-specific SPM data, including measured perplexity reduction (1.44 to 1.20) and command accuracies (99.3% / 95.2%). These are presented as direct results of adaptation and coordination with AI measurement tools, not as quantities derived from or equivalent to the inputs by construction. No equations, ansatzes, uniqueness theorems, or self-citations are shown to load-bear the central determinism claim; the architecture is described modularly with experimental validation at room temperature. This satisfies the default non-circular expectation for an applied experimental paper whose core assertions are falsifiable via hardware trajectories rather than reducing to fitted parameters renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the domain assumption that language models can be specialized via fine-tuning to produce reliable control signals for physical instruments without additional ad-hoc parameters beyond standard training procedures.

axioms (1)
  • domain assumption Small language models can be fine-tuned on domain data to achieve high command accuracy and deterministic behavior in scientific instrument control.
    This is the core premise enabling the reported performance gains and real-time operation.

pith-pipeline@v0.9.0 · 5499 in / 1212 out tokens · 27900 ms · 2026-05-15T20:03:50.795927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

  1. [1]

    Liu, et al

    Y . Liu, et al. , Autonomous scanning probe microscopy with hypothesis lea rning: Explor- ing the physics of domain switching in ferroelectric materi als. Patterns 4 (3), 100704 (2023), doi:https://doi.org/10.1016/j.patter.2023.10 0704, https://www.sciencedirect. com/science/article/pii/S2666389923000417

  2. [2]

    Pratiush, H

    U. Pratiush, H. Funakubo, R. Vasudevan, S. V . Kalinin, Y . Liu, Scientific exploration with expert knowledge (SEEK) in autonomous scanning probe micro scopy with active learning. Digital Discovery 4, 252–263 (2025), doi:10.1039/D4DD00277F, http://dx.doi.org/10. 1039/D4DD00277F

  3. [3]

    S. B. Harris, R. Vasudevan, Y . Liu, Active oversight and q uality control in standard Bayesian optimization for autonomous experiments. npj Computational Materials 11 (1), 23 (2025), doi:10.1038/s41524-024-01485-2, https://doi.org/10.1038/s41524-024-01485-2

  4. [4]

    Diao, et al., AI-Equipped Scanning Probe Microscopy for Autonomous Site-Specific Atomic- Level Characterization at Room Temperature

    Z. Diao, et al., AI-Equipped Scanning Probe Microscopy for Autonomous Site-Specific Atomic- Level Characterization at Room Temperature. Small Methods 9 (1), 2400813 (2025), doi: https://doi.org/10.1002/smtd.202400813, https://doi.org/10.1002/smtd.202400813

  5. [5]

    Sung, et al

    J. Sung, et al. , Autonomous AI-Driven Measurement and Characterization o f 2D Materials Using Scanning Probe Microscopy. Small Structures 6 (12), e202500379 (2025), doi:https: //doi.org/10.1002/sstr.202500379, https://doi.org/10.1002/sstr.202500379

  6. [6]

    Diao, et al

    Z. Diao, et al. , Automatic drift compensation for nanoscale imaging using feature point matching. Applied Physics Letters 122 (12), 121601 (2023), doi:10.1063/5.0139330, https: //doi.org/10.1063/5.0139330

  7. [7]

    D. G. Deveci, et al. , Comprehensive analysis and machine learning-based solut ions for drift behavior in ambient Atomic Force Microscope conditions. Engineering Applications of Ar- tificial Intelligence 159, 111678 (2025), doi:https://doi.org/10.1016/j.engappa i.2025.111678, https://www.sciencedirect.com/science/article/pii/S095219762501680X

  8. [8]

    Z. Diao, L. Hou, M. Abe, Probe conditioning via convoluti on neural network for scanning probe microscopy automation. Applied Physics Express 16 (8), 085002 (2023), doi:10.35848/ 1882-0786/acecd6, https://doi.org/10.35848/1882-0786/acecd6. 24

  9. [9]

    Krull, P

    A. Krull, P . Hirsch, C. Rother, A. Schiffrin, C. Krull, Art ificial-intelligence-driven scanning probe microscopy.Communications Physics 3 (1), 54 (2020), doi:10.1038/s42005-020-0317-3, https://doi.org/10.1038/s42005-020-0317-3

  10. [10]

    A. M. Bran, et al., Augmenting large language models with chemistry tools. Nature Machine Intelligence 6 (5), 525–535 (2024), doi:10.1038/s42256-024-00832-8, https://doi.org/ 10.1038/s42256-024-00832-8

  11. [11]

    Z. Liu, Y . Chai, J. Li, Toward Automated Simulation Research Workflow through LLM Prompt Engineering Design. Journal of Chemical Information and Modeling 65 (1), 114–124 (2025), doi:10.1021/acs.jcim.4c01653, https://doi.org/10.1021/acs.jcim.4c01653

  12. [12]

    M. H. Prince, et al. , Opportunities for retrieval and tool augmented large lang uage mod- els in scientific facilities. npj Computational Materials 10 (1), 251 (2024), doi:10.1038/ s41524-024-01423-2, https://doi.org/10.1038/s41524-024-01423-2

  13. [13]

    D. A. Boiko, R. MacKnight, B. Kline, G. Gomes, Autonomous chemical research with large language models. Nature 624 (7992), 570–578 (2023), doi:10.1038/s41586-023-06792-0 , https://doi.org/10.1038/s41586-023-06792-0

  14. [14]

    Y . Xie, K. He, A. Castellanos-Gomez, Toward Full Autonom ous Laboratory Instrumentation Control with Large Language Models. Small Structures 6 (8), 2500173 (2025), doi:https: //doi.org/10.1002/sstr.202500173, https://doi.org/10.1002/sstr.202500173

  15. [15]

    Y . Liu, M. Checa, R. K. Vasudevan, Synergizing human expe rtise and AI efficiency with language model for microscopy operation and automated expe riment design*. Machine Learning: Science and Technology 5 (2), 02LT01 (2024), doi:10.1088/2632-2153/ad52e9, https://doi.org/10.1088/2632-2153/ad52e9

  16. [16]

    Mandal, et al

    I. Mandal, et al. , Evaluating large language model agents for automation of a tomic force microscopy. Nature Communications 16 (1), 9104 (2025), doi:10.1038/s41467-025-64105-7, https://doi.org/10.1038/s41467-025-64105-7

  17. [17]

    Z. Diao, H. Y amashita, M. Abe, Leveraging large language model and social network service for automation in scanning probe microscopy. Measurement Science and Technology 36 (4), 25 047001 (2025), doi:10.1088/1361-6501/adbf3a, https://doi.org/10.1088/1361-6501/ adbf3a

  18. [18]

    Z. Xu, S. Jain, M. Kankanhalli, Hallucination is Inevita ble: An Innate Limitation of Large Language Models (2025), https://arxiv.org/abs/2401.11817

  19. [19]

    Chen, et al., Precise atom manipulation through deep reinforcement learning

    I.-J. Chen, et al., Precise atom manipulation through deep reinforcement learning. Nature Com- munications 13 (1), 7499 (2022), doi:10.1038/s41467-022-35149-w, https://doi.org/10. 1038/s41467-022-35149-w

  20. [20]

    Okuyama, Z

    J. Okuyama, Z. Diao, H. Y amashita, M. Abe, Integrated AI F ramework for Room-Temperature Atom Manipulation in Scanning Probe Microscopy.Nano Letters25 (51), 17771–17777 (2025), doi:10.1021/acs.nanolett.5c04982, https://doi.org/10.1021/acs.nanolett.5c04982

  21. [21]

    Su, et al

    J. Su, et al. , Intelligent synthesis of magnetic nanographenes via chem ist-intuited atomic robotic probe. Nature Synthesis 3 (4), 466–476 (2024), doi:10.1038/s44160-024-00488-7, https://doi.org/10.1038/s44160-024-00488-7

  22. [22]

    Zhu, et al

    Z. Zhu, et al. , Deep learning drives autonomous molecular reactions with single-bond se- lectivity in tetra-brominated porphyrins on Au(111). Nature Communications (2026), doi: 10.1038/s41467-026-69080-1, https://doi.org/10.1038/s41467-026-69080-1

  23. [23]

    Miret, N

    S. Miret, N. M. A. Krishnan, Enabling large language models for real-world materials discovery. Nature Machine Intelligence7 (7), 991–998 (2025), doi:10.1038/s42256-025-01058-y, https: //doi.org/10.1038/s42256-025-01058-y

  24. [24]

    Alampara, et al

    N. Alampara, et al. , Probing the limitations of multimodal language models for chemistry and materials research. Nature Computational Science 5 (10), 952–961 (2025), doi:10.1038/ s43588-025-00836-3, https://doi.org/10.1038/s43588-025-00836-3

  25. [25]

    M. Shen, M. Umar, K. Maeng, G. E. Suh, U. Gupta, Towards Und erstanding Systems Trade- offs in Retrieval-Augmented Generation Model Inference (20 24), https://arxiv.org/abs/ 2412.11854. 26

  26. [26]

    Zheng, et al., A Review on Edge Large Language Models: Design, Execution, and Appli- cations

    Y . Zheng, et al., A Review on Edge Large Language Models: Design, Execution, and Appli- cations. ACM Comput. Surv. 57 (8) (2025), doi:10.1145/3719664, https://doi.org/10. 1145/3719664

  27. [27]

    Luccioni, Y

    S. Luccioni, Y . Jernite, E. Strubell, Power Hungry Processing: Watts Driving the Cost of AI De- ployment? (2024), doi:10.1145/3630106.3658542, https://doi.org/10.1145/3630106. 3658542

  28. [28]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, et al., LLaMA: Open and Efficient Foundation Language Models (2023) , https: //arxiv.org/abs/2302.13971

  29. [29]

    A. Q. Jiang, et al., Mistral 7B (2023), https://arxiv.org/abs/2310.06825

  30. [30]

    Phi-4 Technical Report

    M. Abdin, et al., Phi-4 Technical Report (2024), https://arxiv.org/abs/2412.08905

  31. [31]

    Z. Diao, H. Y amashita, M. Abe, A metaverse laboratory setup for interactive atom visualization and manipulation with scanning probe microscopy. Scientific Reports 15 (1), 17490 (2025), doi:10.1038/s41598-025-01578-y, https://doi.org/10.1038/s41598-025-01578-y

  32. [32]

    E. J. Hu, et al. , LoRA: Low-Rank Adaptation of Large Language Models (2021) , https: //arxiv.org/abs/2106.09685

  33. [33]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, Y . Artzi, BERTScore: Evaluating Text Gener- ation with BERT (2020), https://arxiv.org/abs/1904.09675

  34. [34]

    Liu, et al

    Y . Liu, et al. , G-Eval: NLG Evaluation using Gpt-4 with Better Human Align ment, in Pro- ceedings of the 2023 Conference on Empirical Methods in Natu ral Language Processing , H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computat ional Linguistics, Singapore) (2023), pp. 2511–2522, doi:10.18653/v1/2023.emnlp-main .153, https://aclanthology. org/2...

  35. [35]

    S. V . Kalinin, et al. , Machine learning for automated experimentation in scanni ng trans- mission electron microscopy. npj Computational Materials 9 (1), 227 (2023), doi:10.1038/ s41524-023-01142-0, https://doi.org/10.1038/s41524-023-01142-0. 27

  36. [36]

    Leitherer, B

    A. Leitherer, B. C. Y eo, C. H. Liebscher, L. M. Ghiringhel li, Automatic identification of crystal structures and interfaces via artificial-intell igence-based electron microscopy. npj Computational Materials 9 (1), 179 (2023), doi:10.1038/s41524-023-01133-1, https: //doi.org/10.1038/s41524-023-01133-1

  37. [37]

    Lannelongue, J

    L. Lannelongue, J. Grealey, M. Inouye, Green Algorithms : Quantifying the Carbon Footprint of Computation. Advanced Science 8 (12), 2100707 (2021), doi:https://doi.org/10.1002/advs . 202100707, https://doi.org/10.1002/advs.202100707

  38. [38]

    Samsi, et al

    S. Samsi, et al. , From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference (2023), https://arxiv.org/abs/2310.03003

  39. [39]

    Decoupled Weight Decay Regularization

    I. Loshchilov, F. Hutter, Decoupled Weight Decay Regula rization (2019), https://arxiv. org/abs/1711.05101. Acknowledgments Funding: This work was supported by Grants-in-Aid for Scientific Research (24K21716, 25K17654) from the Ministry of Education, Culture, Sports, Science an d Technology of Japan. A part of MA work is supported by JKA and its promotion ...