Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

Georg Schaefer; Jakob Rehrl; Julian Langschwert; Simon Hirlaender; Stefan Huber

arxiv: 2606.00059 · v1 · pith:MSMLCKDAnew · submitted 2026-05-19 · 💻 cs.RO · cs.LG

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

Julian Langschwert , Georg Schaefer , Jakob Rehrl , Stefan Huber , Simon Hirlaender This is my paper

Pith reviewed 2026-06-30 18:14 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords reinforcement learningsystem identificationoptimal experiment designmechatronic systemssafety constraintsparameter identificationexcitation signals

0 comments

The pith

A reinforcement learning agent can design excitation signals for mechatronic parameter identification that match classical accuracy while keeping safety violations near 0.75 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Classical system identification requires experts to hand-craft input signals that stay inside hardware limits, which limits how widely the method can be used. The paper trains a reinforcement learning agent to generate those signals automatically for a Quanser Aero 2 testbed. The agent receives shaped rewards that discourage unsafe actions, allowing it to learn inputs that produce accurate estimates of the three system parameters. If the result holds, non-experts could run reliable identification experiments on physical hardware without manual signal design or separate safety layers.

Core claim

The reinforcement learning agent learns excitation signals that achieve competitive estimation accuracy across all three identified parameters on the Quanser Aero 2, outperforming classical baselines while incurring only 0.75 percent safety violations across ten independent training seeds.

What carries the argument

Reinforcement learning agent trained with reward shaping to produce safe, informative excitation signals for system identification.

If this is right

Signal design for system identification no longer requires expert hand-crafting to meet safety limits.
The same reward-shaping approach can keep violation rates low during both learning and deployment phases.
The agent delivers usable parameter estimates for at least the three parameters of the tested device.
The method removes the generalizability barrier that comes from reliance on manually designed signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Retraining the agent on other mechatronic platforms could extend the approach beyond the single testbed studied.
Adding explicit constraint solvers on top of the shaped rewards might drive violations even lower if needed.
The 0.75 percent violation figure suggests the shaping works in practice, but repeating the experiment on hardware with tighter safety margins would test robustness.

Load-bearing premise

Reward shaping by itself is enough to keep the physical hardware safe throughout both training and later use.

What would settle it

Deploy the trained agent on the physical Quanser Aero 2 and measure either large errors in the recovered parameter values or a safety-violation rate above a few percent.

read the original abstract

Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL with reward shaping for excitation design on Quanser Aero 2 gives real-hardware numbers but leaves state, reward, and baseline details unspecified.

read the letter

The main point is that an RL agent learns excitation signals for identifying three parameters on the Quanser Aero 2, reports competitive accuracy against classical baselines, and keeps safety violations at 0.75% over ten seeds by using reward shaping.

The concrete hardware run is the useful piece. Most RL work stays simulated; this one closes the loop on physical equipment and shows the outcome across multiple seeds. The reward-shaping approach for safety is a direct fit for the problem and avoids needing an extra safety layer in the reported setup.

The soft spots sit in the missing pieces. The abstract gives no state representation, no explicit reward function, no description of how the classical baselines were implemented, and no statistical tests. Reward shaping supplies a soft penalty, so the stress-test concern about possible violations under noise or model mismatch is reasonable; nothing shown rules it out. If the full paper supplies the equations, the shaped reward weights, and verification that the 0.75% figure survives deployment, the result becomes easier to trust.

This is for control and robotics researchers who already do system identification on similar testbeds and want to reduce hand-crafted signal design. A reader working with RL in mechatronics or with Quanser hardware would see the most direct value.

I would not cite it in the next year because the scope stays narrow and the verification details are absent from the provided text. It still deserves peer review: the real-hardware evaluation and the safety angle are enough to justify sending it to referees who can check the implementation.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a reinforcement learning agent to generate optimal excitation signals for parameter identification of mechatronic systems on the Quanser Aero 2 testbed. Safety constraints are enforced solely via reward shaping. Across 10 independent training seeds, the agent is claimed to achieve competitive accuracy on three identified parameters while outperforming classical baselines and incurring only 0.75% safety violations.

Significance. If the empirical results prove robust and reproducible, the approach could reduce reliance on expert-designed signals in safe system identification. The work does not ship machine-checked proofs, reproducible code, or parameter-free derivations, so its significance rests entirely on the strength of the reported experiments.

major comments (2)

[Abstract] Abstract: the central claim of outperformance with only 0.75% safety violations cannot be evaluated because the text provides no description of the state representation, reward function, baseline implementations, statistical tests, or the precise definition and measurement of safety violations.
[Abstract] Abstract: the assertion that reward shaping alone enforces hardware safety is load-bearing for the safety claim, yet no analysis, bound, or verification is supplied showing that finite penalty weights prevent violations under model mismatch, sensor noise, or policy deployment on physical hardware.

minor comments (1)

[Abstract] The term 'comprehensive agent' is used without definition or comparison to ablated variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of outperformance with only 0.75% safety violations cannot be evaluated because the text provides no description of the state representation, reward function, baseline implementations, statistical tests, or the precise definition and measurement of safety violations.

Authors: The abstract is intentionally concise, but the requested details are present in the main text: state representation and reward function (including safety penalties) appear in Section 3, baseline implementations and statistical tests (across 10 seeds) in Section 5, and the precise definition/measurement of safety violations (hardware limit exceedances) in Sections 4 and 5. To make the abstract self-contained as requested, we will revise it to include brief parenthetical summaries of these elements without exceeding length limits. revision: yes
Referee: [Abstract] Abstract: the assertion that reward shaping alone enforces hardware safety is load-bearing for the safety claim, yet no analysis, bound, or verification is supplied showing that finite penalty weights prevent violations under model mismatch, sensor noise, or policy deployment on physical hardware.

Authors: We agree that the safety claim rests on empirical results rather than theoretical guarantees. The 0.75% violation rate is measured directly from physical hardware rollouts under the learned policy (reported in Section 5). The manuscript does not supply formal bounds or proofs that finite penalties guarantee zero violations under arbitrary model mismatch or noise; such analysis is outside the paper's empirical scope. We will add an explicit limitations paragraph in Section 6 discussing the empirical nature of the safety enforcement and observed violation statistics under the tested conditions. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical RL training results with no derivation chain

full rationale

The manuscript reports experimental outcomes from training an RL agent on the Quanser Aero 2 hardware, measuring parameter estimation accuracy and safety-violation percentages across seeds. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. The central performance numbers are direct measurements from simulation and hardware runs rather than any algebraic reduction to the training inputs. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.1-grok · 5627 in / 1033 out tokens · 26308 ms · 2026-06-30T18:14:41.628726+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 13 canonical work pages · 1 internal anchor

[1]

doi:10.48550/arXiv.2601.15707

Hu, Q., Celler, B., Mu, W., Su, S.W.: D- Optimality - Guided Reinforcement Learning for Efficient Open - Loop Calibration of a 3- DOF Ankle Rehabilitation Robot (Jan 2026). doi:10.48550/arXiv.2601.15707

work page doi:10.48550/arxiv.2601.15707 2026
[2]

IEEE Transactions on Industrial Informatics 19(11), 11160--11170 (Nov 2023)

Huang, R., Fogelquist, J., Lin, X.: Reinforcement Learning of Optimal Input Excitation for Parameter Estimation With Application to Li - Ion Battery . IEEE Transactions on Industrial Informatics 19(11), 11160--11170 (Nov 2023). doi:10.1109/TII.2023.3244342

work page doi:10.1109/tii.2023.3244342 2023
[3]

Prentice Hall PTR, USA (1999)

Ljung, L.: System identification (2nd ed.): theory for the user. Prentice Hall PTR, USA (1999)

1999
[4]

Aerospace 12(2), 74 (Feb 2025)

Mazhar, M.F., Wasim, M., Abbas, M., Riaz, J., Swati, R.F.: Aircraft System Identification Using Multi - Stage PRBS Optimal Inputs and Maximum Likelihood Estimator . Aerospace 12(2), 74 (Feb 2025). doi:10.3390/aerospace12020074

work page doi:10.3390/aerospace12020074 2025
[5]

Circuits, Systems, and Signal Processing , author =

Prediction error estimation methods , volume =. Circuits, Systems, and Signal Processing , author =. 2002 , pages =. doi:10.1007/BF01211648 , abstract =

work page doi:10.1007/bf01211648 2002
[6]

System identification (2nd ed.): theory for the user , isbn =

Ljung, Lennart , year =. System identification (2nd ed.): theory for the user , isbn =
[7]

Comparison of

Schäfer, Georg and Rehrl, Jakob and Huber, Stefan and Hirlaender, Simon , month = aug, year =. Comparison of. 2024 IEEE 22nd INDIN , publisher =

2024
[8]

The International Journal of Robotics Research , author =

Vehicle model identification by integrated prediction error minimization , volume =. The International Journal of Robotics Research , author =. 2013 , pages =. doi:10.1177/0278364913488635 , abstract =

work page doi:10.1177/0278364913488635 2013
[9]

Chaos , author =

Recurrent neural networks for dynamical systems:. Chaos , author =. 2023 , pages =. doi:10.1063/5.0088748 , abstract =

work page doi:10.1063/5.0088748 2023
[10]

Scheduled

Bengio, Samy and Vinyals, Oriol and Jaitly, Navdeep and Shazeer, Noam , year =. Scheduled. Advances in
[11]

Hatami, Ehsan and Steinberger, Martin , month = jun, year =. Robust. 2025 25th. doi:10.1109/PC65047.2025.11047331 , abstract =

work page doi:10.1109/pc65047.2025.11047331 2025
[12]

Reinforcement

H, Jose Antonio Martin and Vicente, Oscar Fernandez and Perez, Sergio and Belfadil, Anas and Ibanez-Llano, Cristina and Rondon, Freddy Jose Perozo and Valle, Jose Javier and Pelaz, Javier Arechalde , month = dec, year =. Reinforcement. doi:10.48550/arXiv.2212.07123 , abstract =

work page doi:10.48550/arxiv.2212.07123
[13]

, month = jun, year =

LaViola, J.J. , month = jun, year =. A comparison of unscented and extended. Proceedings of the 2003. doi:10.1109/ACC.2003.1243440 , abstract =

work page doi:10.1109/acc.2003.1243440 2003
[14]

Reinforcement

Sutton, Richard S and Barto, Andrew G , file =. Reinforcement
[15]

IFAC-PapersOnLine , author =

Combining system identification with reinforcement learning-based. IFAC-PapersOnLine , author =. 2020 , keywords =. doi:10.1016/j.ifacol.2020.12.2294 , abstract =

work page doi:10.1016/j.ifacol.2020.12.2294 2020
[16]

IEEE Transactions on Industrial Informatics , volume =

Huang, Rui and Fogelquist, Jackson and Lin, Xinfan , title =. IEEE Transactions on Industrial Informatics , volume =. 2023 , doi =

2023
[17]

, title =

Hu, Qifan and Celler, Branko and Mu, Weidong and Su, Steven W. , title =. 2026 , doi =

2026
[18]

Nature Communications , author =

Data driven discovery of cyber physical systems , volume =. Nature Communications , author =. 2019 , pages =. doi:10.1038/s41467-019-12490-1 , abstract =

work page doi:10.1038/s41467-019-12490-1 2019
[19]

Energy , author =

Data-driven energy prediction modeling for both energy efficiency and maintenance in smart manufacturing systems , volume =. Energy , author =. 2022 , keywords =. doi:10.1016/j.energy.2021.121691 , abstract =

work page doi:10.1016/j.energy.2021.121691 2022
[20]

Aerospace , volume =

Mazhar, Muhammad Fawad and Wasim, Muhammad and Abbas, Manzar and Riaz, Jamshed and Swati, Raees Fida , title =. Aerospace , volume =. 2025 , doi =

2025
[21]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , month = aug, year =. Proximal. doi:10.48550/arXiv.1707.06347 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347

[1] [1]

doi:10.48550/arXiv.2601.15707

Hu, Q., Celler, B., Mu, W., Su, S.W.: D- Optimality - Guided Reinforcement Learning for Efficient Open - Loop Calibration of a 3- DOF Ankle Rehabilitation Robot (Jan 2026). doi:10.48550/arXiv.2601.15707

work page doi:10.48550/arxiv.2601.15707 2026

[2] [2]

IEEE Transactions on Industrial Informatics 19(11), 11160--11170 (Nov 2023)

Huang, R., Fogelquist, J., Lin, X.: Reinforcement Learning of Optimal Input Excitation for Parameter Estimation With Application to Li - Ion Battery . IEEE Transactions on Industrial Informatics 19(11), 11160--11170 (Nov 2023). doi:10.1109/TII.2023.3244342

work page doi:10.1109/tii.2023.3244342 2023

[3] [3]

Prentice Hall PTR, USA (1999)

Ljung, L.: System identification (2nd ed.): theory for the user. Prentice Hall PTR, USA (1999)

1999

[4] [4]

Aerospace 12(2), 74 (Feb 2025)

Mazhar, M.F., Wasim, M., Abbas, M., Riaz, J., Swati, R.F.: Aircraft System Identification Using Multi - Stage PRBS Optimal Inputs and Maximum Likelihood Estimator . Aerospace 12(2), 74 (Feb 2025). doi:10.3390/aerospace12020074

work page doi:10.3390/aerospace12020074 2025

[5] [5]

Circuits, Systems, and Signal Processing , author =

Prediction error estimation methods , volume =. Circuits, Systems, and Signal Processing , author =. 2002 , pages =. doi:10.1007/BF01211648 , abstract =

work page doi:10.1007/bf01211648 2002

[6] [6]

System identification (2nd ed.): theory for the user , isbn =

Ljung, Lennart , year =. System identification (2nd ed.): theory for the user , isbn =

[7] [7]

Comparison of

Schäfer, Georg and Rehrl, Jakob and Huber, Stefan and Hirlaender, Simon , month = aug, year =. Comparison of. 2024 IEEE 22nd INDIN , publisher =

2024

[8] [8]

The International Journal of Robotics Research , author =

Vehicle model identification by integrated prediction error minimization , volume =. The International Journal of Robotics Research , author =. 2013 , pages =. doi:10.1177/0278364913488635 , abstract =

work page doi:10.1177/0278364913488635 2013

[9] [9]

Chaos , author =

Recurrent neural networks for dynamical systems:. Chaos , author =. 2023 , pages =. doi:10.1063/5.0088748 , abstract =

work page doi:10.1063/5.0088748 2023

[10] [10]

Scheduled

Bengio, Samy and Vinyals, Oriol and Jaitly, Navdeep and Shazeer, Noam , year =. Scheduled. Advances in

[11] [11]

Hatami, Ehsan and Steinberger, Martin , month = jun, year =. Robust. 2025 25th. doi:10.1109/PC65047.2025.11047331 , abstract =

work page doi:10.1109/pc65047.2025.11047331 2025

[12] [12]

Reinforcement

H, Jose Antonio Martin and Vicente, Oscar Fernandez and Perez, Sergio and Belfadil, Anas and Ibanez-Llano, Cristina and Rondon, Freddy Jose Perozo and Valle, Jose Javier and Pelaz, Javier Arechalde , month = dec, year =. Reinforcement. doi:10.48550/arXiv.2212.07123 , abstract =

work page doi:10.48550/arxiv.2212.07123

[13] [13]

, month = jun, year =

LaViola, J.J. , month = jun, year =. A comparison of unscented and extended. Proceedings of the 2003. doi:10.1109/ACC.2003.1243440 , abstract =

work page doi:10.1109/acc.2003.1243440 2003

[14] [14]

Reinforcement

Sutton, Richard S and Barto, Andrew G , file =. Reinforcement

[15] [15]

IFAC-PapersOnLine , author =

Combining system identification with reinforcement learning-based. IFAC-PapersOnLine , author =. 2020 , keywords =. doi:10.1016/j.ifacol.2020.12.2294 , abstract =

work page doi:10.1016/j.ifacol.2020.12.2294 2020

[16] [16]

IEEE Transactions on Industrial Informatics , volume =

Huang, Rui and Fogelquist, Jackson and Lin, Xinfan , title =. IEEE Transactions on Industrial Informatics , volume =. 2023 , doi =

2023

[17] [17]

, title =

Hu, Qifan and Celler, Branko and Mu, Weidong and Su, Steven W. , title =. 2026 , doi =

2026

[18] [18]

Nature Communications , author =

Data driven discovery of cyber physical systems , volume =. Nature Communications , author =. 2019 , pages =. doi:10.1038/s41467-019-12490-1 , abstract =

work page doi:10.1038/s41467-019-12490-1 2019

[19] [19]

Energy , author =

Data-driven energy prediction modeling for both energy efficiency and maintenance in smart manufacturing systems , volume =. Energy , author =. 2022 , keywords =. doi:10.1016/j.energy.2021.121691 , abstract =

work page doi:10.1016/j.energy.2021.121691 2022

[20] [20]

Aerospace , volume =

Mazhar, Muhammad Fawad and Wasim, Muhammad and Abbas, Manzar and Riaz, Jamshed and Swati, Raees Fida , title =. Aerospace , volume =. 2025 , doi =

2025

[21] [21]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , month = aug, year =. Proximal. doi:10.48550/arXiv.1707.06347 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347