Bi-Level Reinforcement Learning Control for an Underactuated Blimp via Center-of-Mass Reconfiguration
Pith reviewed 2026-05-09 14:56 UTC · model grok-4.3
The pith
Bi-level reinforcement learning decouples center-of-mass planning from thrust control to enable accurate tracking in underactuated blimps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an outer reinforcement learning policy that selects a target-dependent center-of-mass configuration, paired with an inner policy that produces thrust commands, together with a two-stage learning strategy, overcomes the strong nonlinear coupling and underactuation inherent in a compact two-thruster blimp, delivering higher tracking accuracy, robustness, and sim-to-real transfer than either fixed-CoM baselines or PID controllers across a 27-goal test set in both simulation and hardware.
What carries the argument
The bi-level policy architecture that explicitly separates task-level CoM configuration planning from continuous thrust generation, supported by a two-stage training process.
If this is right
- Straight-line tracking accuracy improves without adding more thrusters or complex actuators.
- Robustness increases relative to both fixed center-of-mass and classical PID controllers.
- Sim-to-real transfer succeeds for this class of underactuated aerial systems.
- Compact blimp designs become viable for applications that value payload capacity and energy use.
- The explicit decoupling reduces the effect of strong nonlinear dynamics on closed-loop performance.
Where Pith is reading between the lines
- The same outer-inner split could be tested on other underactuated vehicles such as certain quadrotors or underwater gliders that allow internal mass shifting.
- Allowing the outer policy to adjust the CoM continuously during flight rather than only before takeoff might extend the method to more agile maneuvers.
- The convergence analysis of the bi-level process offers a template for designing stable hierarchical controllers in other nonlinear robotic systems.
- Longer endurance missions become feasible if the reduced actuator count lowers power draw while maintaining path accuracy.
Load-bearing premise
That splitting control into a pre-flight CoM planning policy and a separate thrust policy, trained in two stages, is sufficient to manage the nonlinear coupling and underactuation without extra actuators or detailed models.
What would settle it
If real-world trials on the 27-goal set showed no improvement in tracking accuracy or robustness over fixed-CoM or PID controllers, or if sim-to-real transfer collapsed, the central claim would be falsified.
Figures
read the original abstract
This paper investigates goal-directed tracking control of underactuated blimps with center-of-mass (CoM) reconfiguration. Unlike conventional overactuated blimp designs that rely on redundant actuation for simplified control, this paper focuses on a compact architecture consisting of two thrusters and a movable internal slider, aiming to improve energy efficiency and payload capacity. This hardware-efficient configuration introduces significant underactuation and strong nonlinear coupling between CoM dynamics and vehicle motion. To address these challenges, this paper proposes a bi-level reinforcement learning framework that explicitly decouples task-level CoM planning from continuous thrust control. The outer policy determines a target-dependent CoM configuration prior to flight, while the inner policy generates thrust commands to track straight-line references. To ensure stable learning, this paper introduces a two-stage learning strategy, supported by a convergence analysis of the resulting bi-level process. Extensive simulations and real-world experiments on a 27-goal evaluation set demonstrate that the proposed method consistently outperforms fixed-CoM baselines and PID-based controllers, achieving higher tracking accuracy, enhanced robustness, and reliable sim-to-real transfer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address goal-directed tracking control for underactuated blimps using a bi-level reinforcement learning framework. It decouples CoM reconfiguration planning (outer policy selecting target-dependent pose pre-flight) from inner thrust control for straight-line tracking. A two-stage learning strategy with convergence analysis is introduced, and extensive sim and real experiments on 27 goals show outperformance over fixed-CoM baselines and PID controllers in tracking accuracy, robustness, and sim-to-real transfer.
Significance. If the central claims hold, this work could significantly advance control strategies for underactuated aerial vehicles by leveraging CoM reconfiguration and RL to handle nonlinear couplings without additional actuators. The two-stage learning and convergence analysis provide a structured approach to stable policy training. Strengths include the hardware-efficient design focus and comprehensive experimental evaluation demonstrating practical sim-to-real applicability. This could influence designs for energy-efficient blimps in applications like surveillance or delivery.
major comments (3)
- Convergence analysis section: the analysis of the bi-level process relies on stability margins or Lipschitz conditions that are not independently verified against the real blimp's aerodynamic effects and slider friction. If those margins are violated even on a subset of the 27 goals, the reported tracking accuracy and sim-to-real transfer would not generalize.
- Bi-level framework (outer/inner policy decoupling): the assumption that a single pre-flight CoM pose selected by the outer policy renders the underactuated 2-thruster dynamics sufficiently controllable for the inner policy to reject all disturbances without further CoM motion requires stronger justification, as residual nonlinear couplings may persist.
- Experimental results on 27-goal set: the claim of consistent outperformance lacks reported quantitative metrics, error bars, or statistical tests in the evaluation details, making it difficult to assess the magnitude and reliability of improvements over fixed-CoM and PID baselines.
minor comments (3)
- Abstract: the mention of 'convergence analysis' should briefly note the key assumptions or theorems to aid readers.
- Notation throughout: ensure consistent symbols for outer CoM policy versus inner thrust policy to avoid ambiguity in the method description.
- Figures in experiments: trajectory plots should include error bands or results from multiple runs to improve clarity of robustness claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: Convergence analysis section: the analysis of the bi-level process relies on stability margins or Lipschitz conditions that are not independently verified against the real blimp's aerodynamic effects and slider friction. If those margins are violated even on a subset of the 27 goals, the reported tracking accuracy and sim-to-real transfer would not generalize.
Authors: We acknowledge that the convergence analysis is derived under modeling assumptions and would be strengthened by explicit validation against hardware-specific effects. In the revised manuscript, we will add a sensitivity analysis that perturbs the Lipschitz constants and stability margins using experimentally measured ranges for aerodynamic drag and slider friction. We will also report the percentage of the 27 goals for which the assumptions hold and include a discussion of any observed performance degradation when margins are approached. revision: yes
-
Referee: Bi-level framework (outer/inner policy decoupling): the assumption that a single pre-flight CoM pose selected by the outer policy renders the underactuated 2-thruster dynamics sufficiently controllable for the inner policy to reject all disturbances without further CoM motion requires stronger justification, as residual nonlinear couplings may persist.
Authors: The single pre-flight CoM selection is intended to shift the system equilibrium so that straight-line tracking becomes feasible with thrust alone, exploiting the underactuation structure. We agree that residual couplings warrant further justification. The revision will include an expanded controllability analysis (rank conditions on the linearized dynamics after CoM shift) and an ablation experiment comparing fixed-pose versus continuous-CoM policies to quantify the impact of any remaining nonlinearities. revision: partial
-
Referee: Experimental results on 27-goal set: the claim of consistent outperformance lacks reported quantitative metrics, error bars, or statistical tests in the evaluation details, making it difficult to assess the magnitude and reliability of improvements over fixed-CoM and PID baselines.
Authors: We agree that the current presentation of results would benefit from greater quantitative detail. The revised experimental section will include tables with mean tracking errors and standard deviations for position and orientation across all 27 goals, error bars on all performance plots, and statistical significance tests (paired t-tests) comparing the proposed method against the fixed-CoM and PID baselines in both simulation and real-world trials. revision: yes
Circularity Check
No circularity: empirical validation independent of derivations
full rationale
The paper presents a bi-level RL control method for an underactuated blimp, with an outer policy selecting CoM configuration and an inner policy handling thrust tracking, plus a two-stage training strategy and convergence analysis. All load-bearing claims of superior tracking accuracy, robustness, and sim-to-real transfer are supported by direct experimental comparisons on a 27-goal set against fixed-CoM baselines and PID controllers, rather than any first-principles derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or steps reduce by construction to their inputs; the framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters
axioms (1)
- domain assumption The bi-level process converges under the proposed two-stage learning strategy.
Reference graph
Works this paper leans on
-
[1]
J. Dong, H. Yu, B. Lu, H. Liu, and Y . Fang, “Adaptive output feedback trajectory tracking control of an indoor blimp: Controller design and experiment validation,”IEEE Transactions on Industrial Electronics, vol. 72, no. 4, pp. 3960–3971, 2025
work page 2025
-
[2]
Past, present, and future of aerial robotic manipulators,
A. Ollero, M. Tognon, A. Suarez, D. Lee, and A. Franchi, “Past, present, and future of aerial robotic manipulators,”IEEE Transactions on Robotics, vol. 38, no. 1, pp. 626–645, 2022. Fig. 7. Time histories of cross-track errore trk for nine representative targets on theζ x = 4.5 mslice. (a)–(i) correspond to different combinations of ζy ∈ {−2,0,2}mandζ z ∈...
work page 2022
-
[3]
Prototype, modeling, and control of aerial robots with physical interaction: A review,
H. Zhong, J. Liang, Y . Chen, H. Zhang, J. Mao, and Y . Wang, “Prototype, modeling, and control of aerial robots with physical interaction: A review,”IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 3528–3542, 2025
work page 2025
-
[4]
A morphing quadrotor-blimp with balloon failure resilience for mobile ecological sensing,
S. Sharma, M. Verhoeff, F. Joosen, R. Venkatesha Prasad, and S. Hamaza, “A morphing quadrotor-blimp with balloon failure resilience for mobile ecological sensing,”IEEE Robotics and Automation Letters, vol. 9, no. 7, pp. 6408–6415, 2024
work page 2024
-
[5]
Monocular vision-based human following on miniature robotic blimp,
N. Yao, E. Anaya, Q. Tao, S. Cho, H. Zheng, and F. Zhang, “Monocular vision-based human following on miniature robotic blimp,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3244–3249
work page 2017
-
[6]
Review of autonomous outdoor blimps and their applications,
S. S. Bhat, S. G. Anavatti, M. Garratt, and S. Ravi, “Review of autonomous outdoor blimps and their applications,”Drone Systems and Applications, vol. 12, pp. 1–21, 2024
work page 2024
-
[7]
Local positioning system using uwb range measurements for an unmanned blimp,
V . Mai, M. Kamel, M. Krebs, A. Schaffner, D. Meier, L. Paull, and R. Siegwart, “Local positioning system using uwb range measurements for an unmanned blimp,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 2971–2978, 2018
work page 2018
-
[8]
Human pointing motion during interaction with an autonomous blimp,
M. Hou and F. Zhang, “Human pointing motion during interaction with an autonomous blimp,”Scientific Reports, vol. 12, p. 11402, 07 2022
work page 2022
-
[9]
Sblimp: Design, model, and translational motion control for a swing-blimp,
J. Xu, D. S. D’Antonio, D. J. Ammirato, and D. Salda ˜na, “Sblimp: Design, model, and translational motion control for a swing-blimp,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 6977–6982
work page 2023
-
[10]
Swing- reducing flight control system for an underactuated indoor miniature autonomous blimp,
Q. Tao, J. Wang, Z. Xu, T. X. Lin, Y . Yuan, and F. Zhang, “Swing- reducing flight control system for an underactuated indoor miniature autonomous blimp,”IEEE/ASME Transactions on Mechatronics, vol. 26, no. 4, pp. 1895–1904, 2021
work page 1904
-
[11]
Rgblimp-q: Robotic gliding blimp with moving mass control based on a bird-inspired continuum arm,
H. Cheng and F. Zhang, “Rgblimp-q: Robotic gliding blimp with moving mass control based on a bird-inspired continuum arm,”IEEE Transactions on Robotics, vol. 41, pp. 5097–5116, 2025
work page 2025
-
[12]
M. Kasmalkar, L. Baird, and S. Coogan, “Feedback linearization of an underactuated miniature blimp with zero dynamics mitigation using high order control barrier functions,”IEEE Control Systems Letters, vol. 8, pp. 2589–2594, 2024
work page 2024
-
[13]
Bioinspired intermittent control of a miniature autonomous blimp for tracking a moving target,
R. J. Suitor, D. Sofge, and D. A. Paley, “Bioinspired intermittent control of a miniature autonomous blimp for tracking a moving target,” in OCEANS 2024 - Halifax, 2024, pp. 1–9
work page 2024
-
[14]
Design and autonomous control of a solar-power blimp,
C. Wan, N. Kingry, and R. Dai, “Design and autonomous control of a solar-power blimp,” 01 2018
work page 2018
-
[15]
Developing a low-cost autonomous indoor blimp,
J. L ´opez, G. P, R. Sanz, and W. Burgard, “Developing a low-cost autonomous indoor blimp,”Journal of Physical Agents, vol. 3, 01 2009
work page 2009
-
[16]
F. Mazzei, L. Teofili, F. Curti, and C. Gargiulo, “Mission analysis, dynamics and robust control of an indoor blimp in a cern detector magnetic environment,”Frontiers in Robotics and AI, vol. 10, 10 2023
work page 2023
-
[17]
An underactuated control system design for adaptive autopilot of fixed-wing drones,
S. Baldi, S. Roy, K. Yang, and D. Liu, “An underactuated control system design for adaptive autopilot of fixed-wing drones,”IEEE/ASME Transactions on Mechatronics, vol. 27, no. 5, pp. 4045–4056, 2022
work page 2022
-
[18]
Q. Chen, C. Ye, W. Lin, Z. Liu, X. Yu, J. Qiu, and H. Gao, “Action- based contrastive unsupervised representations for reinforcement learn- ing toward robotic manipulation,”IEEE Transactions on Industrial Electronics, vol. 73, no. 2, pp. 3104–3113, 2026
work page 2026
-
[19]
A novel robotic skill learning approach for assembly task with dynamical system and broad learning,
J. Zhang, Z. Jin, Z. Zhao, and C. Yang, “A novel robotic skill learning approach for assembly task with dynamical system and broad learning,” IEEE Transactions on Industrial Electronics, vol. 72, no. 9, pp. 9304– 9313, 2025
work page 2025
-
[20]
Z. Zhao, Y . Weng, Z. Liu, C. Yang, and C. L. P. Chen, “Broad reinforcement learning for adaptive control of a 2-dof helicopter system with unknown dead zone,”IEEE Transactions on Industrial Electronics, vol. 72, no. 4, pp. 3984–3993, 2025
work page 2025
-
[21]
Autonomous blimp control viah ∞ robust deep residual reinforcement learning,
Y . Zuo, Y . T. Liu, and A. Ahmad, “Autonomous blimp control viah ∞ robust deep residual reinforcement learning,” in2023 IEEE 19th Inter- national Conference on Automation Science and Engineering (CASE), 2023, pp. 1–8
work page 2023
-
[22]
R. Jiang, X. Cheng, H. Sang, Z. Wang, Y . Zhou, and B. He, “Gthsl: A goal-task-driven hierarchical sharing learning method to learn long- horizon tasks autonomously,”IEEE Transactions on Industrial Electron- ics, vol. 72, no. 4, pp. 3994–4005, 2025
work page 2025
-
[23]
Hierarchical reinforcement learning with universal policies for multi- step robotic manipulation,
X. Yang, Z. Ji, J. Wu, Y .-K. Lai, C. Wei, G. Liu, and R. Setchi, “Hierarchical reinforcement learning with universal policies for multi- step robotic manipulation,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4727–4741, 2022
work page 2022
-
[24]
W. Zhu and M. Hayashibe, “A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation,”IEEE Transactions on Industrial Electronics, vol. 70, no. 5, pp. 4962–4971, 2023
work page 2023
-
[25]
Planning-augmented hierarchical reinforcement learning,
R. Gieselmann and F. T. Pokorny, “Planning-augmented hierarchical reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5097–5104, 2021
work page 2021
-
[26]
Rgblimp: Robotic gliding blimp - design, modeling, development, and aerodynamics analysis,
H. Cheng, Z. Sha, Y . Zhu, and F. Zhang, “Rgblimp: Robotic gliding blimp - design, modeling, development, and aerodynamics analysis,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7273–7280, 2023
work page 2023
-
[27]
Real-world learning control for autonomous exploration of a biomimetic robotic shark,
S. Yan, Z. Wu, J. Wang, Y . Huang, M. Tan, and J. Yu, “Real-world learning control for autonomous exploration of a biomimetic robotic shark,”IEEE Transactions on Industrial Electronics, vol. 70, no. 4, pp. 3966–3974, 2023
work page 2023
-
[28]
Chaos-augmented reinforcement learning with diffusion models for robust legged robot locomotion,
H. Zhang, C. Hua, J. Chen, X. Luo, and J. Wei, “Chaos-augmented reinforcement learning with diffusion models for robust legged robot locomotion,”IEEE Transactions on Industrial Electronics, vol. 73, no. 2, pp. 2600–2609, 2026
work page 2026
-
[29]
Task and domain adaptive rein- forcement learning for robot control,
Y . T. Liu, N. Singh, and A. Ahmad, “Task and domain adaptive rein- forcement learning for robot control,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 656– 663
work page 2024
-
[30]
Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870
work page 2018
-
[31]
Provably convergent two-timescale off-policy actor-critic with function approximation,
S. Zhang, B. Liu, H. Yao, and S. Whiteson, “Provably convergent two-timescale off-policy actor-critic with function approximation,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 11 204–11 213
work page 2020
-
[32]
V . R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,”SIAM J. Control Optim., vol. 42, no. 4, p. 1143–1166, Apr. 2003
work page 2003
-
[33]
A stochastic approximation method,
H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951. Xiaorui Wang(Student Member, IEEE) received the bachelor’s degree in robotics engineering in 2025 from Peking University, Beijing, China, where he is currently working toward the Ph.D. degree in general mechanics and foundation of mechani...
work page 1951
-
[34]
He is currently an Associate Professor of Robotics Engineering with Peking University, Beijing, China. His research interests include mechatronics systems, robotics and controls, aerial vehicles, and underwater vehicles
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.