Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

Andrew Rothstein; Egemen Kolemen; Hiro Josep Farre Kaga; Ian Char; Jeff Schneider; Jiayu Chen; Ricardo Shousha; Rohit Sonker

arxiv: 2605.05857 · v2 · pith:7ZABN6EFnew · submitted 2026-05-07 · 💻 cs.LG

Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

Rohit Sonker , Hiro Josep Farre Kaga , Jiayu Chen , Andrew Rothstein , Ian Char , Ricardo Shousha , Egemen Kolemen , Jeff Schneider This is my paper

Pith reviewed 2026-06-30 23:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords offline reinforcement learningtokamak controlplasma rotation profileDIII-Dmodel-based RLfusion plasma control

0 comments

The pith

Offline reinforcement learning trained only on past DIII-D shots can control the full plasma rotation profile in real time on the tokamak.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether offline RL and model-based offline RL can produce usable control policies for the high-dimensional plasma rotation profile when no accurate simulator exists. Training occurs exclusively on historical experimental data from the DIII-D device, with probabilistic dynamics models used to generate synthetic rollouts for policy improvement. The resulting policy is then executed on the live tokamak, yielding promising performance. A sympathetic reader cares because this route could bypass the simulator bottleneck that blocks conventional control design for many tokamak problems.

Core claim

We investigate the use of offline RL and offline model-based RL algorithms for rotation profile control, training them solely on historical data from the DIII-D tokamak. Our final method uses probabilistic models of plasma dynamics to generate rollouts for RL training. We deploy this policy on the DIII-D Tokamak and observe promising real-world results.

What carries the argument

Probabilistic models of plasma dynamics that generate rollouts for training offline RL policies solely from historical DIII-D shots.

If this is right

Multi-input multi-output rotation-profile control becomes feasible without an accurate first-principles simulator.
Data collected from past shots can be reused to train policies that are then run in closed loop on the device.
The same offline-model-based workflow may extend to other high-dimensional tokamak control tasks once sufficient historical coverage exists.
Real-time deployment reveals concrete challenges in safety, actuator limits, and data coverage that must be addressed for reliable operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach scales, offline RL could serve as a general template for learning control laws on other fusion devices where simulators lag behind experimental data.
Iterative data collection—running the policy, logging new trajectories, and retraining—could steadily enlarge the covered state space.
The method implicitly assumes that the probabilistic model captures enough of the dynamics to avoid harmful extrapolation; violations would appear as unsafe actions on the device.

Load-bearing premise

Historical DIII-D shots contain sufficient coverage of the relevant state-action space and plasma conditions so that a policy learned from them will generalize safely and effectively when executed in real time on the live device.

What would settle it

Execution of the learned policy on a new set of DIII-D discharges that produces large deviations from target rotation profiles or triggers instabilities would falsify the claim of effective real-world control.

Figures

Figures reproduced from arXiv: 2605.05857 by Andrew Rothstein, Egemen Kolemen, Hiro Josep Farre Kaga, Ian Char, Jeff Schneider, Jiayu Chen, Ricardo Shousha, Rohit Sonker.

**Figure 1.** Figure 1: RL Policy Training using trajectories generated autoregressively from the RPNN dynam view at source ↗

**Figure 2.** Figure 2: RL Policy Deployment on the DIII-D Tokamak. The policy is uploaded onto the Plasma view at source ↗

**Figure 3.** Figure 3: Real tokamak results for Shot 203022 from DIII-D. The first plot shows value of rotation view at source ↗

**Figure 4.** Figure 4: Simulated results for shot 203022 using the dynamics-model environment. Two target view at source ↗

**Figure 5.** Figure 5: Full RTCAKENN Rotation profile at various time slices during the first target change. view at source ↗

**Figure 6.** Figure 6: Full Rotation RTCAKENN profiles at various time slices during the second target change. view at source ↗

**Figure 7.** Figure 7: Gas Voltage and subsequently the gas flow rate increases at approx t=3s, corresponding view at source ↗

read the original abstract

Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influences stability, confinement, and transport. While the average rotation can be controlled, controlling the full profile is challenging due to high dimensionality, response to multiple actuators and dependence on plasma condition. Learning-based control methods, such as reinforcement learning (RL), provide a potential solution to this challenging problem with ability to model complex interactions leading to effective multi-input multi-output control. However, learning such policies is challenging due to the lack of accurate simulators that can model the rotation profile dynamics. In this work, we investigate the use of offline RL and offline model-based RL algorithms for rotation profile control, training them solely on historical data from the DIII-D tokamak. Our final method uses probabilistic models of plasma dynamics to generate rollouts for RL training. We deploy this policy on the DIII-D Tokamak and observe promising real-world results. We conclude by highlighting key challenges and insights from training and deploying an RL policy on a complex physical device while using only limited past data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They report training offline RL on historical DIII-D shots and deploying a policy for rotation profile control on the real machine, but the abstract supplies no metrics or validation details.

read the letter

The core claim is a real-world deployment of an offline RL policy for tokamak rotation profile control, trained only on past DIII-D data with probabilistic models for rollouts. That combination is new enough to note, since most prior work stays in simulation.

They handle a genuine control problem where accurate simulators are missing and multiple actuators affect a high-dimensional profile. Using historical shots avoids the simulator gap, and the focus on model-based offline RL fits the data constraints. The closing section on challenges from limited data and physical deployment could be useful for others trying similar transfers.

The soft spot is exactly the one in the stress-test note. The abstract asserts deployment and promising results but gives no coverage statistics, out-of-distribution checks, safety filters, baselines, or quantitative outcomes. Without those, it is impossible to judge whether the historical data actually supported safe generalization or whether the policy stayed inside the training distribution during live runs. If the full paper contains those numbers and they hold up, the result strengthens; on the abstract alone the evidence is too thin to evaluate the central claim.

This paper is for fusion control researchers and applied RL groups working on physical systems with offline data. A reader already familiar with DIII-D experiments might extract practical lessons from the deployment section, but most others will need the missing metrics before treating the result as established.

I would send it to peer review. Real hardware attempts in this domain are uncommon and worth referee scrutiny even if the current version needs substantial additions on validation and safety.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates offline RL and offline model-based RL for plasma rotation profile control in tokamaks. Policies are trained exclusively on historical DIII-D data; probabilistic models of plasma dynamics are used to generate rollouts during training. The central claim is that the resulting policy was deployed on the DIII-D tokamak and produced promising real-world results, with the work concluding by discussing challenges of real-world RL deployment from limited offline data.

Significance. Demonstrating safe, effective real-world deployment of an offline RL policy for a high-dimensional, safety-critical control task in a fusion device would be a notable contribution, as it addresses the absence of accurate simulators and shows practical generalization from historical data. The work also surfaces concrete deployment challenges that are valuable for the community. However, the absence of quantitative metrics, baselines, coverage analysis, or safety validation in support of the deployment claim substantially limits the assessed significance at present.

major comments (2)

[Abstract] Abstract: The claim that 'we deploy this policy on the DIII-D Tokamak and observe promising real-world results' is presented without any quantitative performance metrics, comparison to existing controllers or baselines, safety-constraint satisfaction rates, or validation details. This evidence is load-bearing for the paper's primary contribution and must be supplied to substantiate the generalization from offline data to live execution.
[Deployment / Results] Deployment / Results section: No analysis is provided of state-action space coverage in the historical DIII-D dataset, out-of-distribution detection during rollouts or execution, or mechanisms to prevent unsafe extrapolation. The skeptic concern that historical shots may not cover the relevant plasma conditions therefore remains unaddressed, directly affecting the reliability of the reported real-time deployment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of substantiating the real-world deployment claim, which we address point by point below. We will revise the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'we deploy this policy on the DIII-D Tokamak and observe promising real-world results' is presented without any quantitative performance metrics, comparison to existing controllers or baselines, safety-constraint satisfaction rates, or validation details. This evidence is load-bearing for the paper's primary contribution and must be supplied to substantiate the generalization from offline data to live execution.

Authors: We agree that the abstract claim would be strengthened by additional detail. The deployment section of the manuscript describes the real-time execution on DIII-D and characterizes the outcomes as promising based on the observed alignment of the rotation profile with targets during the limited experimental shots. However, the study was exploratory with a small number of deployments, and no formal quantitative metrics (e.g., integrated error or direct controller comparisons) were recorded due to operational constraints. We will revise the abstract to more precisely describe the results as 'we deploy this policy on the DIII-D Tokamak under supervised conditions and observe profile evolution consistent with the learned behavior' and include any available time-series plots from the deployment in the revised manuscript. revision: partial
Referee: [Deployment / Results] Deployment / Results section: No analysis is provided of state-action space coverage in the historical DIII-D dataset, out-of-distribution detection during rollouts or execution, or mechanisms to prevent unsafe extrapolation. The skeptic concern that historical shots may not cover the relevant plasma conditions therefore remains unaddressed, directly affecting the reliability of the reported real-time deployment.

Authors: The referee correctly notes the absence of explicit coverage analysis. The training dataset consists of all qualifying historical DIII-D discharges containing rotation profile measurements, and deployment occurred on plasmas with parameters within the observed historical ranges. No dedicated out-of-distribution detection module was implemented; instead, the existing tokamak safety systems and operator oversight served as safeguards against unsafe actions. We will add a new subsection in the revised manuscript that summarizes key statistics of the historical dataset (e.g., ranges of plasma current, density, and actuator commands) and explicitly describes the operational safeguards used during real-time execution. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper trains offline RL and model-based RL policies solely on historical DIII-D data, uses probabilistic models for rollouts during training, and reports real-world deployment results. No equations, parameter fits, or self-citations are described that would make any reported result equivalent to its inputs by construction. The central claims rest on external historical data and physical deployment, with no self-definitional, fitted-input-renamed-as-prediction, or uniqueness-via-self-citation patterns present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all modeling choices remain implicit.

pith-pipeline@v0.9.1-grok · 5756 in / 1087 out tokens · 21207 ms · 2026-06-30T23:44:28.313649+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

[1]

Niannian Wu, Zongyu Yang, Rongpeng Li, Ning Wei, Yihang Chen, Qianyun Dong, Jiyuan Li, Guohui Zheng, Xinwen Gong, Feng Gao, et al

doi: 10.1109/ICRA.2017.7989202. Niannian Wu, Zongyu Yang, Rongpeng Li, Ning Wei, Yihang Chen, Qianyun Dong, Jiyuan Li, Guohui Zheng, Xinwen Gong, Feng Gao, et al. High-fidelity data-driven dynamics model for reinforcement learning-based control in hl-3 tokamak.Communications Physics, 8(1):393, 2025. 15 SONKERKAGACHENROTHSTEINCHARSHOUSHAKOLEMENSCHNEIDER Ti...

work page doi:10.1109/icra.2017.7989202 2017
[2]

RPNN Training 6.1.1

Appendix 6.1. RPNN Training 6.1.1. STATE ANDACTUATORSPACE- The dynamics model (an ensemble of RPNN networks) models the transitions on a larger state and action space, in comparison to the policy. This is shown in table 2. Variables marked under Actuators and States are used in dynamics modeling. Only a subset of the states is provided as observations to ...
[3]

Fig.??shows the profile variation when the first target change occurs

Profile Views In this section, we present full rotation profile views across time. Fig.??shows the profile variation when the first target change occurs. This target change corresponds to a decrease in profile at t = 3s. Fig.??shows the profile changes that occur during the second target change at t = 4.5 s. At this point, more variations occur in the pla...
[4]

These plots show the actual amount of gas flowing into the plasma, corresponding well to the gas valve voltage, which the policy controls

Gas Flow In this section, plots showing the gas flow rate into the plasma are shown in Fig.??. These plots show the actual amount of gas flowing into the plasma, corresponding well to the gas valve voltage, which the policy controls. 19 SONKERKAGACHENROTHSTEINCHARSHOUSHAKOLEMENSCHNEIDER 0.0 0.5 1.0 n 0 50 100 150 200Rotation (km/s) t = 4400 ms 0.0 0.5 1.0...

[1] [1]

Niannian Wu, Zongyu Yang, Rongpeng Li, Ning Wei, Yihang Chen, Qianyun Dong, Jiyuan Li, Guohui Zheng, Xinwen Gong, Feng Gao, et al

doi: 10.1109/ICRA.2017.7989202. Niannian Wu, Zongyu Yang, Rongpeng Li, Ning Wei, Yihang Chen, Qianyun Dong, Jiyuan Li, Guohui Zheng, Xinwen Gong, Feng Gao, et al. High-fidelity data-driven dynamics model for reinforcement learning-based control in hl-3 tokamak.Communications Physics, 8(1):393, 2025. 15 SONKERKAGACHENROTHSTEINCHARSHOUSHAKOLEMENSCHNEIDER Ti...

work page doi:10.1109/icra.2017.7989202 2017

[2] [2]

RPNN Training 6.1.1

Appendix 6.1. RPNN Training 6.1.1. STATE ANDACTUATORSPACE- The dynamics model (an ensemble of RPNN networks) models the transitions on a larger state and action space, in comparison to the policy. This is shown in table 2. Variables marked under Actuators and States are used in dynamics modeling. Only a subset of the states is provided as observations to ...

[3] [3]

Fig.??shows the profile variation when the first target change occurs

Profile Views In this section, we present full rotation profile views across time. Fig.??shows the profile variation when the first target change occurs. This target change corresponds to a decrease in profile at t = 3s. Fig.??shows the profile changes that occur during the second target change at t = 4.5 s. At this point, more variations occur in the pla...

[4] [4]

These plots show the actual amount of gas flowing into the plasma, corresponding well to the gas valve voltage, which the policy controls

Gas Flow In this section, plots showing the gas flow rate into the plasma are shown in Fig.??. These plots show the actual amount of gas flowing into the plasma, corresponding well to the gas valve voltage, which the policy controls. 19 SONKERKAGACHENROTHSTEINCHARSHOUSHAKOLEMENSCHNEIDER 0.0 0.5 1.0 n 0 50 100 150 200Rotation (km/s) t = 4400 ms 0.0 0.5 1.0...