STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy
Pith reviewed 2026-06-30 07:19 UTC · model grok-4.3
The pith
The perception pipeline, not navigation strategy, determines dose efficiency in autonomous STEM imaging.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 33 agent configurations under realistic dose budgets, the dominant determinant of dose efficiency is the analyst pipeline, not the navigator: pairing a trained CNN analyst with naïve raster scanning raises DEC-AUC by 5.5x over a CNN-free raster baseline (0.287 vs. 0.052), while substituting Bayesian or adaptive finite-state-machine navigation for raster yields no statistically significant further gain.
What carries the argument
STEMGym benchmark of 15 physics-simulated worlds, scored by Dose-Efficiency Curve area (DEC-AUC), that decouples perception, navigation, and planning under a single dose budget.
If this is right
- Research effort in autonomous electron microscopy should shift from navigation algorithms toward perception models.
- Once a capable analyst is present, naïve raster scanning becomes competitive with more complex navigators.
- Task-specific CNNs outperform general-purpose vision-language models by a factor of approximately 13 on crystallographic defect analysis.
- The benchmark supplies a shared testbed for measuring where additional ML investment yields the largest returns under dose constraints.
Where Pith is reading between the lines
- The same perception-dominant pattern may appear in other dose-limited modalities such as cryo-EM or X-ray imaging.
- Transfer experiments on real instruments would reveal whether simulation-to-real gaps alter the ranking of perception versus navigation.
- Future benchmarks could isolate whether joint training of analyst and navigator under a shared dose budget changes the observed dominance of perception.
Load-bearing premise
The 15 physics-simulated STEM worlds accurately represent real-world dose-damage tradeoffs and imaging physics across the five materials and four tasks.
What would settle it
Executing the identical 33 agent configurations on a physical STEM instrument and finding that Bayesian or adaptive navigation produces statistically significant DEC-AUC gains beyond those obtained from the CNN analyst alone.
Figures
read the original abstract
A central premise of autonomous scientific imaging is that smarter navigation, whether Bayesian, RL-based, or otherwise adaptive, is the principal lever for sample-efficient acquisition. We present evidence to the contrary in scanning transmission electron microscopy (STEM), an atomic-resolution imaging modality whose every measurement deposits damaging electron dose. We introduce STEMGym, an open-source Gymnasium benchmark of 15 physics-simulated STEM worlds spanning five materials, three difficulty levels, and four characterisation tasks, scored by the Dose-Efficiency Curve area (DEC-AUC), a single scalar capturing the information-vs-dose Pareto frontier. Across 33 agent configurations under realistic dose budgets, the dominant determinant of dose efficiency is the analyst (perception) pipeline, not the navigator: pairing a trained CNN analyst with na\"ive raster scanning raises DEC-AUC by 5.5x over a CNN-free raster baseline (0.287 vs.\ 0.052), while substituting Bayesian or adaptive finite-state-machine navigation for raster yields no statistically significant further gain. Production-tier vision-language models further underperform task-specific CNNs by {\sim}13x on crystallographic defect analysis. By decoupling perception, navigation, and planning under a unified dose budget, STEMGym reframes where ML effort should be invested in autonomous electron microscopy and provides the measurement infrastructure to test it.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces STEMGym, an open-source Gymnasium benchmark comprising 15 physics-simulated STEM worlds spanning five materials, three difficulty levels, and four characterisation tasks. Agents are evaluated under realistic dose budgets using the Dose-Efficiency Curve area (DEC-AUC) metric. Across 33 configurations, the central claim is that the analyst (perception) pipeline dominates dose efficiency: a trained CNN paired with naïve raster scanning yields a 5.5× DEC-AUC gain over a CNN-free raster baseline (0.287 vs. 0.052), while replacing raster with Bayesian or adaptive finite-state-machine navigation produces no statistically significant further improvement. Task-specific CNNs also outperform production vision-language models by ~13× on crystallographic defect analysis. The work decouples perception, navigation, and planning to argue for redirecting ML effort toward perception in autonomous electron microscopy.
Significance. If the simulated environments faithfully reproduce real dose-damage tradeoffs, imaging physics, and noise statistics, the result would usefully reorient research priorities in autonomous STEM by demonstrating that perception quality is the primary bottleneck under dose constraints. The open-source benchmark, unified Gymnasium interface, and explicit separation of analyst and navigator components are concrete strengths that enable reproducible comparisons. The provision of a single scalar (DEC-AUC) capturing the information-vs-dose Pareto frontier is a practical contribution for future benchmarking.
major comments (2)
- [Simulation Environments and Results] Simulation Environments and Results sections: All quantitative claims (5.5× DEC-AUC gain, null navigation effect, 13× VLM gap) rest on the 15 physics-simulated worlds. The manuscript does not report direct experimental validation or sensitivity analysis comparing simulated damage thresholds, contrast formation, or cumulative dose effects against real STEM instrument data for the five materials and four tasks; without such grounding, the transferability of the perception-dominance conclusion to physical autonomous microscopy cannot be assessed.
- [§4 (Agent Configurations)] §4 (Agent Configurations) and statistical reporting: The statement of “no statistically significant further gain” from Bayesian or adaptive FSM navigation requires explicit p-values, confidence intervals, or correction for multiple comparisons across the 33 configurations; the current description leaves the power and exact test procedure unspecified, which is load-bearing for the null-navigation claim.
minor comments (2)
- [Abstract] Abstract: the escaped quote in “na"ive” should be rendered consistently as “naïve” or “naive” for readability.
- [Figures and Tables] Figure captions and tables: ensure all DEC-AUC values are accompanied by the number of independent runs and error bars so that the 0.287 vs. 0.052 comparison can be directly interpreted.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the scope and evidentiary basis of our claims. We respond to each major comment below and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Simulation Environments and Results] Simulation Environments and Results sections: All quantitative claims (5.5× DEC-AUC gain, null navigation effect, 13× VLM gap) rest on the 15 physics-simulated worlds. The manuscript does not report direct experimental validation or sensitivity analysis comparing simulated damage thresholds, contrast formation, or cumulative dose effects against real STEM instrument data for the five materials and four tasks; without such grounding, the transferability of the perception-dominance conclusion to physical autonomous microscopy cannot be assessed.
Authors: We agree this is a substantive limitation. The environments are constructed from established physics models of electron scattering, dose accumulation, and Poisson noise (detailed in the methods and supplementary information), but we do not provide direct side-by-side comparison with experimental STEM data for the specific materials and tasks. This restricts strong claims about immediate transfer to physical instruments. In revision we will add an explicit Limitations subsection that (i) states the simulation-to-real gap, (ii) reports the parameter ranges used for damage thresholds and contrast, and (iii) outlines a concrete experimental validation protocol for future work. We will not claim the current results directly predict instrument performance. revision: partial
-
Referee: [§4 (Agent Configurations)] §4 (Agent Configurations) and statistical reporting: The statement of “no statistically significant further gain” from Bayesian or adaptive FSM navigation requires explicit p-values, confidence intervals, or correction for multiple comparisons across the 33 configurations; the current description leaves the power and exact test procedure unspecified, which is load-bearing for the null-navigation claim.
Authors: We accept the criticism. The revised manuscript will report, for each analyst-navigator pair, (a) mean DEC-AUC with 95% bootstrap confidence intervals, (b) paired Wilcoxon signed-rank p-values comparing navigation variants against raster under the same analyst, and (c) Bonferroni-adjusted thresholds across the 33 configurations. The updated numbers continue to show no significant navigation effect once the analyst is fixed (adjusted p > 0.05), but the statistical details will now be fully specified. revision: yes
- Performing new experimental STEM runs on physical instruments for all five materials, three difficulty levels, and four tasks to directly validate every simulated damage and contrast parameter is outside the scope of the present benchmark paper.
Circularity Check
No circularity; benchmark results are independent empirical comparisons
full rationale
The paper's central claim rests on direct performance measurements (DEC-AUC scores) obtained by running 33 agent configurations on the 15 newly introduced physics-simulated STEM worlds. No equations, fitted parameters, or self-citations are invoked to derive the perception-vs-navigation ranking; the reported 5.5× gain and null navigation effect are outputs of the benchmark execution itself. The simulation physics and task definitions are external to the result and do not reduce the comparisons to tautologies. This is the normal case of a self-contained benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physics simulations in the 15 worlds accurately capture real STEM dose-damage and imaging tradeoffs
Reference graph
Works this paper leans on
-
[1]
Mic-hackathon 2024: hackathon on machine learning for electron and scanning probe microscopy.Machine Learning: Science and Technology, 6(4):040701,
Utkarsh Pratiush, Austin Houston, Kamyar Barakati, Aditya Raghavan, Ralph Bulanadi, Xiangyu Yin, Samuel S Welborn, Dasol Yoon, Harikrishnan KP, Zhaslan Baraissov, et al. Mic-hackathon 2024: hackathon on machine learning for electron and scanning probe microscopy.Machine Learning: Science and Technology, 6(4):040701,
2024
-
[2]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Results of the active learning challenge
Isabelle Guyon, Gavin C Cawley, Gideon Dror, and Vincent Lemaire. Results of the active learning challenge. InActive Learning and Experimental Design workshop In conjunction with AISTATS 2010, pages 19–45. JMLR Workshop and Conference Proceedings,
2010
-
[4]
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process opti- mization in the bandit setting: No regret and experimental design.arXiv preprint arXiv:0912.3995,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL https://doi.org/10.5281/zenodo. 4555979. Yu A Abramov, VG Tsirelson, VE Zavodnik, SA Ivanov, and ID Brown. The chemical bond and atomic displacements in srtio3 from x-ray diffraction analysis.Structural Science, 51(6):942–951,
-
[7]
float32 # nm coordinates atom_types (N,) int32 # 0=pristine, 1=vac, 2=sub defect_mask (N,) bool # True if defect phase_map (H, W) int32 # Optional /valid_region (H, W) bool C Simulation Parameters All simulations use 200 keV beam energy, 21 mrad convergence semi-angle, and HAADF detector angles of 68 mrad to 200 mrad. Frozen-phonon thermal diffuse scatter...
1995
-
[8]
0 INVEST
0] Analyst 3×UNet → DefCNN → PhaseNet Ensemble uncertainty estimation Planner FSM: 5 modes (panel b) Coverage/defect/budget triggers ExperimentState (shared blackboard) STEMEnv (Gymnasium) action obs, info STEMAgent (b)Planner FSM SURVEY β =3. 0 INVEST. β =2. 0 CHAR. β =0. 5 CENSUS β =1. 0 TERM. β =0. 0 cov ≥ 15% defects ≥ 2 cov ≥ 50% budget ≤ 20% budget ...
2015
-
[9]
GP-BO with hand-crafted features
Ablation tests on STEMAgent (removing uncertainty or planner) also show negligible and non-significant changes, suggesting these components do not materially affect DEC-AUC under the tested conditions. Equivalence testing (TOST).To formally claim that the three Analyst-equipped agents are equiv- alent rather than merely non-distinguishable, we run two one...
2023
-
[10]
navigation is noise
(b) Cadence Rank Stability Defect Census Phase Mapping Targeted ρ = 0.95 RandomRasterGP-UCBRast+AGP+A STEMAgent 0.0 0.1 0.2 0.3 0.5 0.6 0.7DEC-AUC 0.03 0.04 0.03 0.13 0.13 0.12 (c) Per-Episode Reliability SrTiO3 BaTiO3 SiGe GaN Eval material SrTiO3 BaTiO3 SiGe GaN Train material 0.74 0.72 0.11 0.13 0.83 0.73 0.11 0.17 0.41 0.33 0.04 0.17 0.45 0.31 0.09 0....
2021
-
[11]
15 annotated HAADF-STEM worlds
achieve high performance on nanoparticles (up to 0.908) but near-zero performance on crystallography. Aggregated over all five worlds these opposing strengths result in similar overall DEC-AUC values for the CNN ( 0.198) and the strongest VLM (0.186 for Claude Haiku), with Raster+CNN slightly higher due to its stronger performance on 26 the four crystalli...
2015
-
[12]
STEM images and associated parameters for Sm-doped BFO
To anchor the sim-only evaluation against real experimental acquisitions, we ran the perception- dominance comparison across a controlled Sm doping gradient (Sm∈ {0,7,10,13,20}% ) on real HAADF-STEM acquisitions of Sm-doped BiFeO3. Dataset and provenance.The five acquisitions are taken from a published Zenodo deposit ti- tled “STEM images and associated p...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.