STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy

Can Polat; Erchin Serpedin; Hasan Kurban; Mustafa Kurban

arxiv: 2606.29592 · v1 · pith:NOUE6WM4new · submitted 2026-06-28 · 💻 cs.LG · cond-mat.mtrl-sci· physics.atom-ph· physics.optics· quant-ph

STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy

Can Polat , Erchin Serpedin , Mustafa Kurban , Hasan Kurban This is my paper

Pith reviewed 2026-06-30 07:19 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sciphysics.atom-phphysics.opticsquant-ph

keywords autonomous electron microscopydose efficiencySTEMGymsequential decision-makingperception pipelinenavigation strategiesDEC-AUC

0 comments

The pith

The perception pipeline, not navigation strategy, determines dose efficiency in autonomous STEM imaging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common premise that adaptive navigation algorithms are the main route to efficient imaging when every measurement costs damaging electron dose. In a new benchmark of 15 simulated STEM environments, a trained CNN analyst combined with simple raster scanning already delivers 5.5 times higher dose-efficiency score than a CNN-free raster baseline. Replacing raster with Bayesian or finite-state-machine navigation produces no further statistically significant improvement. The same benchmark shows production vision-language models lag task-specific CNNs by roughly 13 times on defect analysis tasks. The work therefore reframes research priorities around perception quality rather than navigation sophistication.

Core claim

Across 33 agent configurations under realistic dose budgets, the dominant determinant of dose efficiency is the analyst pipeline, not the navigator: pairing a trained CNN analyst with naïve raster scanning raises DEC-AUC by 5.5x over a CNN-free raster baseline (0.287 vs. 0.052), while substituting Bayesian or adaptive finite-state-machine navigation for raster yields no statistically significant further gain.

What carries the argument

STEMGym benchmark of 15 physics-simulated worlds, scored by Dose-Efficiency Curve area (DEC-AUC), that decouples perception, navigation, and planning under a single dose budget.

If this is right

Research effort in autonomous electron microscopy should shift from navigation algorithms toward perception models.
Once a capable analyst is present, naïve raster scanning becomes competitive with more complex navigators.
Task-specific CNNs outperform general-purpose vision-language models by a factor of approximately 13 on crystallographic defect analysis.
The benchmark supplies a shared testbed for measuring where additional ML investment yields the largest returns under dose constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perception-dominant pattern may appear in other dose-limited modalities such as cryo-EM or X-ray imaging.
Transfer experiments on real instruments would reveal whether simulation-to-real gaps alter the ranking of perception versus navigation.
Future benchmarks could isolate whether joint training of analyst and navigator under a shared dose budget changes the observed dominance of perception.

Load-bearing premise

The 15 physics-simulated STEM worlds accurately represent real-world dose-damage tradeoffs and imaging physics across the five materials and four tasks.

What would settle it

Executing the identical 33 agent configurations on a physical STEM instrument and finding that Bayesian or adaptive navigation produces statistically significant DEC-AUC gains beyond those obtained from the CNN analyst alone.

Figures

Figures reproduced from arXiv: 2606.29592 by Can Polat, Erchin Serpedin, Hasan Kurban, Mustafa Kurban.

**Figure 1.** Figure 1: STEMGYM overview and material systems. (a) Platform pipeline: crystallographic structures are simulated via PRISM multislice into tiled HDF5 worlds; a Gymnasium environment exposes Navigate/Acquire/Declare/Done actions. (b) DEC concept: an intelligent agent (blue) reaches accuracy threshold s ∗ at lower dose fraction than raster (orange); shaded areas represent DEC-AUC. Bottom: representative HAADF-STEM ti… view at source ↗

**Figure 2.** Figure 2: Benchmark results. (a)–(c) DEC curves: equipped baselines separate sharply from naïve agents. (d) Budget sensitivity: rankings stable across 1000 to 10 000 e Å −2 . (e)–(g) Bottleneck decomposition (bootstrap 95% CIs): large perception gap, flat navigation trend. (h) Replay validation: Raster (0.792) > GP-UCB (0.676) > Random (0.646) > STEMAgent (0.434). (i)–(k) VLM comparison: CNN dominates VLMs on crysta… view at source ↗

**Figure 3.** Figure 3: shows the full STEMAgent architecture and planner FSM. (a) Architecture Navigator GP-UCB: x ∗ = arg max µ + βσ β set by Planner ∈ [0.5, 3.0] Analyst 3×UNet → DefCNN → PhaseNet Ensemble uncertainty estimation Planner FSM: 5 modes (panel b) Coverage/defect/budget triggers ExperimentState (shared blackboard) STEMEnv (Gymnasium) action obs, info STEMAgent (b) Planner FSM SURVEY β=3.0 INVEST. β=2.0 CHAR. β=0.5 … view at source ↗

**Figure 4.** Figure 4: Quantitative analyses (relocated from main text for space). [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Spatial exploration and acquisition trajectories. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

read the original abstract

A central premise of autonomous scientific imaging is that smarter navigation, whether Bayesian, RL-based, or otherwise adaptive, is the principal lever for sample-efficient acquisition. We present evidence to the contrary in scanning transmission electron microscopy (STEM), an atomic-resolution imaging modality whose every measurement deposits damaging electron dose. We introduce STEMGym, an open-source Gymnasium benchmark of 15 physics-simulated STEM worlds spanning five materials, three difficulty levels, and four characterisation tasks, scored by the Dose-Efficiency Curve area (DEC-AUC), a single scalar capturing the information-vs-dose Pareto frontier. Across 33 agent configurations under realistic dose budgets, the dominant determinant of dose efficiency is the analyst (perception) pipeline, not the navigator: pairing a trained CNN analyst with na\"ive raster scanning raises DEC-AUC by 5.5x over a CNN-free raster baseline (0.287 vs.\ 0.052), while substituting Bayesian or adaptive finite-state-machine navigation for raster yields no statistically significant further gain. Production-tier vision-language models further underperform task-specific CNNs by {\sim}13x on crystallographic defect analysis. By decoupling perception, navigation, and planning under a unified dose budget, STEMGym reframes where ML effort should be invested in autonomous electron microscopy and provides the measurement infrastructure to test it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STEMGym gives a clean benchmark separating perception from navigation in dose-limited STEM, but the big perception-dominates claim rests on unvalidated simulations.

read the letter

The paper's main point is straightforward: in their STEMGym setup, swapping in a trained CNN analyst lifts dose efficiency a lot more than swapping raster for Bayesian or adaptive navigation. The 5.5x DEC-AUC jump from CNN+raster over plain raster, and the lack of further gain from smarter navigation, is the result they want people to notice. They also flag that production VLMs lag task-specific CNNs by a wide margin on defect analysis.

What is new is the benchmark itself—15 physics-simulated worlds, four tasks, three difficulty levels, and the DEC-AUC scalar that collapses the information-versus-dose curve into one number. Decoupling the analyst, navigator, and planner under a shared dose budget is useful infrastructure. The open-source Gymnasium framing makes it straightforward for others to plug in their own agents or perception modules.

The soft spot is the simulation layer. The abstract and claim treat the 15 worlds as representative of real dose-damage tradeoffs across five materials, yet nothing in the provided text shows direct comparison to instrument data, noise statistics, or damage thresholds. If the simulated contrast, cumulative damage, or electron-sample interactions deviate from physical STEM, both the large perception effect and the null navigation result become harder to trust for real labs. The statistical tests on the 33 configurations also need the full methods section to judge power and multiple-comparison handling.

This is for groups building autonomous microscopy pipelines or anyone who needs a controlled testbed for perception-versus-planning questions in dose-sensitive settings. A reader who wants to try new agents or perception models on a standardized dose budget will get immediate value from the environment and metric.

It deserves a serious referee. The benchmark and metric are concrete contributions even if the transfer story needs more evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces STEMGym, an open-source Gymnasium benchmark comprising 15 physics-simulated STEM worlds spanning five materials, three difficulty levels, and four characterisation tasks. Agents are evaluated under realistic dose budgets using the Dose-Efficiency Curve area (DEC-AUC) metric. Across 33 configurations, the central claim is that the analyst (perception) pipeline dominates dose efficiency: a trained CNN paired with naïve raster scanning yields a 5.5× DEC-AUC gain over a CNN-free raster baseline (0.287 vs. 0.052), while replacing raster with Bayesian or adaptive finite-state-machine navigation produces no statistically significant further improvement. Task-specific CNNs also outperform production vision-language models by ~13× on crystallographic defect analysis. The work decouples perception, navigation, and planning to argue for redirecting ML effort toward perception in autonomous electron microscopy.

Significance. If the simulated environments faithfully reproduce real dose-damage tradeoffs, imaging physics, and noise statistics, the result would usefully reorient research priorities in autonomous STEM by demonstrating that perception quality is the primary bottleneck under dose constraints. The open-source benchmark, unified Gymnasium interface, and explicit separation of analyst and navigator components are concrete strengths that enable reproducible comparisons. The provision of a single scalar (DEC-AUC) capturing the information-vs-dose Pareto frontier is a practical contribution for future benchmarking.

major comments (2)

[Simulation Environments and Results] Simulation Environments and Results sections: All quantitative claims (5.5× DEC-AUC gain, null navigation effect, 13× VLM gap) rest on the 15 physics-simulated worlds. The manuscript does not report direct experimental validation or sensitivity analysis comparing simulated damage thresholds, contrast formation, or cumulative dose effects against real STEM instrument data for the five materials and four tasks; without such grounding, the transferability of the perception-dominance conclusion to physical autonomous microscopy cannot be assessed.
[§4 (Agent Configurations)] §4 (Agent Configurations) and statistical reporting: The statement of “no statistically significant further gain” from Bayesian or adaptive FSM navigation requires explicit p-values, confidence intervals, or correction for multiple comparisons across the 33 configurations; the current description leaves the power and exact test procedure unspecified, which is load-bearing for the null-navigation claim.

minor comments (2)

[Abstract] Abstract: the escaped quote in “na"ive” should be rendered consistently as “naïve” or “naive” for readability.
[Figures and Tables] Figure captions and tables: ensure all DEC-AUC values are accompanied by the number of independent runs and error bars so that the 0.287 vs. 0.052 comparison can be directly interpreted.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps clarify the scope and evidentiary basis of our claims. We respond to each major comment below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Simulation Environments and Results] Simulation Environments and Results sections: All quantitative claims (5.5× DEC-AUC gain, null navigation effect, 13× VLM gap) rest on the 15 physics-simulated worlds. The manuscript does not report direct experimental validation or sensitivity analysis comparing simulated damage thresholds, contrast formation, or cumulative dose effects against real STEM instrument data for the five materials and four tasks; without such grounding, the transferability of the perception-dominance conclusion to physical autonomous microscopy cannot be assessed.

Authors: We agree this is a substantive limitation. The environments are constructed from established physics models of electron scattering, dose accumulation, and Poisson noise (detailed in the methods and supplementary information), but we do not provide direct side-by-side comparison with experimental STEM data for the specific materials and tasks. This restricts strong claims about immediate transfer to physical instruments. In revision we will add an explicit Limitations subsection that (i) states the simulation-to-real gap, (ii) reports the parameter ranges used for damage thresholds and contrast, and (iii) outlines a concrete experimental validation protocol for future work. We will not claim the current results directly predict instrument performance. revision: partial
Referee: [§4 (Agent Configurations)] §4 (Agent Configurations) and statistical reporting: The statement of “no statistically significant further gain” from Bayesian or adaptive FSM navigation requires explicit p-values, confidence intervals, or correction for multiple comparisons across the 33 configurations; the current description leaves the power and exact test procedure unspecified, which is load-bearing for the null-navigation claim.

Authors: We accept the criticism. The revised manuscript will report, for each analyst-navigator pair, (a) mean DEC-AUC with 95% bootstrap confidence intervals, (b) paired Wilcoxon signed-rank p-values comparing navigation variants against raster under the same analyst, and (c) Bonferroni-adjusted thresholds across the 33 configurations. The updated numbers continue to show no significant navigation effect once the analyst is fixed (adjusted p > 0.05), but the statistical details will now be fully specified. revision: yes

standing simulated objections not resolved

Performing new experimental STEM runs on physical instruments for all five materials, three difficulty levels, and four tasks to directly validate every simulated damage and contrast parameter is outside the scope of the present benchmark paper.

Circularity Check

0 steps flagged

No circularity; benchmark results are independent empirical comparisons

full rationale

The paper's central claim rests on direct performance measurements (DEC-AUC scores) obtained by running 33 agent configurations on the 15 newly introduced physics-simulated STEM worlds. No equations, fitted parameters, or self-citations are invoked to derive the perception-vs-navigation ranking; the reported 5.5× gain and null navigation effect are outputs of the benchmark execution itself. The simulation physics and task definitions are external to the result and do not reduce the comparisons to tautologies. This is the normal case of a self-contained benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a benchmark rather than a derivation; the central claim rests on the domain assumption that simulations match reality.

axioms (1)

domain assumption Physics simulations in the 15 worlds accurately capture real STEM dose-damage and imaging tradeoffs
Required for benchmark results to inform physical experiments

pith-pipeline@v0.9.1-grok · 5787 in / 1136 out tokens · 23307 ms · 2026-06-30T07:19:34.232502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Mic-hackathon 2024: hackathon on machine learning for electron and scanning probe microscopy.Machine Learning: Science and Technology, 6(4):040701,

Utkarsh Pratiush, Austin Houston, Kamyar Barakati, Aditya Raghavan, Ralph Bulanadi, Xiangyu Yin, Samuel S Welborn, Dasol Yoon, Harikrishnan KP, Zhaslan Baraissov, et al. Mic-hackathon 2024: hackathon on machine learning for electron and scanning probe microscopy.Machine Learning: Science and Technology, 6(4):040701,

2024
[2]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Results of the active learning challenge

Isabelle Guyon, Gavin C Cawley, Gideon Dror, and Vincent Lemaire. Results of the active learning challenge. InActive Learning and Experimental Design workshop In conjunction with AISTATS 2010, pages 19–45. JMLR Workshop and Conference Proceedings,

2010
[4]

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process opti- mization in the bandit setting: No regret and experimental design.arXiv preprint arXiv:0912.3995,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https://doi.org/10.5281/zenodo. 4555979. Yu A Abramov, VG Tsirelson, VE Zavodnik, SA Ivanov, and ID Brown. The chemical bond and atomic displacements in srtio3 from x-ray diffraction analysis.Structural Science, 51(6):942–951,

work page doi:10.5281/zenodo
[7]

float32 # nm coordinates atom_types (N,) int32 # 0=pristine, 1=vac, 2=sub defect_mask (N,) bool # True if defect phase_map (H, W) int32 # Optional /valid_region (H, W) bool C Simulation Parameters All simulations use 200 keV beam energy, 21 mrad convergence semi-angle, and HAADF detector angles of 68 mrad to 200 mrad. Frozen-phonon thermal diffuse scatter...

1995
[8]

0 INVEST

0] Analyst 3×UNet → DefCNN → PhaseNet Ensemble uncertainty estimation Planner FSM: 5 modes (panel b) Coverage/defect/budget triggers ExperimentState (shared blackboard) STEMEnv (Gymnasium) action obs, info STEMAgent (b)Planner FSM SURVEY β =3. 0 INVEST. β =2. 0 CHAR. β =0. 5 CENSUS β =1. 0 TERM. β =0. 0 cov ≥ 15% defects ≥ 2 cov ≥ 50% budget ≤ 20% budget ...

2015
[9]

GP-BO with hand-crafted features

Ablation tests on STEMAgent (removing uncertainty or planner) also show negligible and non-significant changes, suggesting these components do not materially affect DEC-AUC under the tested conditions. Equivalence testing (TOST).To formally claim that the three Analyst-equipped agents are equiv- alent rather than merely non-distinguishable, we run two one...

2023
[10]

navigation is noise

(b) Cadence Rank Stability Defect Census Phase Mapping Targeted ρ = 0.95 RandomRasterGP-UCBRast+AGP+A STEMAgent 0.0 0.1 0.2 0.3 0.5 0.6 0.7DEC-AUC 0.03 0.04 0.03 0.13 0.13 0.12 (c) Per-Episode Reliability SrTiO3 BaTiO3 SiGe GaN Eval material SrTiO3 BaTiO3 SiGe GaN Train material 0.74 0.72 0.11 0.13 0.83 0.73 0.11 0.17 0.41 0.33 0.04 0.17 0.45 0.31 0.09 0....

2021
[11]

15 annotated HAADF-STEM worlds

achieve high performance on nanoparticles (up to 0.908) but near-zero performance on crystallography. Aggregated over all five worlds these opposing strengths result in similar overall DEC-AUC values for the CNN ( 0.198) and the strongest VLM (0.186 for Claude Haiku), with Raster+CNN slightly higher due to its stronger performance on 26 the four crystalli...

2015
[12]

STEM images and associated parameters for Sm-doped BFO

To anchor the sim-only evaluation against real experimental acquisitions, we ran the perception- dominance comparison across a controlled Sm doping gradient (Sm∈ {0,7,10,13,20}% ) on real HAADF-STEM acquisitions of Sm-doped BiFeO3. Dataset and provenance.The five acquisitions are taken from a published Zenodo deposit ti- tled “STEM images and associated p...

work page doi:10.5281/zenodo.4555979 2021

[1] [1]

Mic-hackathon 2024: hackathon on machine learning for electron and scanning probe microscopy.Machine Learning: Science and Technology, 6(4):040701,

Utkarsh Pratiush, Austin Houston, Kamyar Barakati, Aditya Raghavan, Ralph Bulanadi, Xiangyu Yin, Samuel S Welborn, Dasol Yoon, Harikrishnan KP, Zhaslan Baraissov, et al. Mic-hackathon 2024: hackathon on machine learning for electron and scanning probe microscopy.Machine Learning: Science and Technology, 6(4):040701,

2024

[2] [2]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Results of the active learning challenge

Isabelle Guyon, Gavin C Cawley, Gideon Dror, and Vincent Lemaire. Results of the active learning challenge. InActive Learning and Experimental Design workshop In conjunction with AISTATS 2010, pages 19–45. JMLR Workshop and Conference Proceedings,

2010

[4] [4]

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process opti- mization in the bandit setting: No regret and experimental design.arXiv preprint arXiv:0912.3995,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

URL https://doi.org/10.5281/zenodo. 4555979. Yu A Abramov, VG Tsirelson, VE Zavodnik, SA Ivanov, and ID Brown. The chemical bond and atomic displacements in srtio3 from x-ray diffraction analysis.Structural Science, 51(6):942–951,

work page doi:10.5281/zenodo

[7] [7]

float32 # nm coordinates atom_types (N,) int32 # 0=pristine, 1=vac, 2=sub defect_mask (N,) bool # True if defect phase_map (H, W) int32 # Optional /valid_region (H, W) bool C Simulation Parameters All simulations use 200 keV beam energy, 21 mrad convergence semi-angle, and HAADF detector angles of 68 mrad to 200 mrad. Frozen-phonon thermal diffuse scatter...

1995

[8] [8]

0 INVEST

0] Analyst 3×UNet → DefCNN → PhaseNet Ensemble uncertainty estimation Planner FSM: 5 modes (panel b) Coverage/defect/budget triggers ExperimentState (shared blackboard) STEMEnv (Gymnasium) action obs, info STEMAgent (b)Planner FSM SURVEY β =3. 0 INVEST. β =2. 0 CHAR. β =0. 5 CENSUS β =1. 0 TERM. β =0. 0 cov ≥ 15% defects ≥ 2 cov ≥ 50% budget ≤ 20% budget ...

2015

[9] [9]

GP-BO with hand-crafted features

Ablation tests on STEMAgent (removing uncertainty or planner) also show negligible and non-significant changes, suggesting these components do not materially affect DEC-AUC under the tested conditions. Equivalence testing (TOST).To formally claim that the three Analyst-equipped agents are equiv- alent rather than merely non-distinguishable, we run two one...

2023

[10] [10]

navigation is noise

(b) Cadence Rank Stability Defect Census Phase Mapping Targeted ρ = 0.95 RandomRasterGP-UCBRast+AGP+A STEMAgent 0.0 0.1 0.2 0.3 0.5 0.6 0.7DEC-AUC 0.03 0.04 0.03 0.13 0.13 0.12 (c) Per-Episode Reliability SrTiO3 BaTiO3 SiGe GaN Eval material SrTiO3 BaTiO3 SiGe GaN Train material 0.74 0.72 0.11 0.13 0.83 0.73 0.11 0.17 0.41 0.33 0.04 0.17 0.45 0.31 0.09 0....

2021

[11] [11]

15 annotated HAADF-STEM worlds

achieve high performance on nanoparticles (up to 0.908) but near-zero performance on crystallography. Aggregated over all five worlds these opposing strengths result in similar overall DEC-AUC values for the CNN ( 0.198) and the strongest VLM (0.186 for Claude Haiku), with Raster+CNN slightly higher due to its stronger performance on 26 the four crystalli...

2015

[12] [12]

STEM images and associated parameters for Sm-doped BFO

To anchor the sim-only evaluation against real experimental acquisitions, we ran the perception- dominance comparison across a controlled Sm doping gradient (Sm∈ {0,7,10,13,20}% ) on real HAADF-STEM acquisitions of Sm-doped BiFeO3. Dataset and provenance.The five acquisitions are taken from a published Zenodo deposit ti- tled “STEM images and associated p...

work page doi:10.5281/zenodo.4555979 2021