Bi-Level Reinforcement Learning Control for an Underactuated Blimp via Center-of-Mass Reconfiguration

Feitian Zhang; Hao Cheng; Hongwu Wang; Xiaorui Wang; Yue Fan

arxiv: 2605.01289 · v1 · submitted 2026-05-02 · 💻 cs.RO

Bi-Level Reinforcement Learning Control for an Underactuated Blimp via Center-of-Mass Reconfiguration

Xiaorui Wang , Hongwu Wang , Yue Fan , Hao Cheng , Feitian Zhang This is my paper

Pith reviewed 2026-05-09 14:56 UTC · model grok-4.3

classification 💻 cs.RO

keywords bi-level reinforcement learningunderactuated blimpcenter-of-mass reconfigurationgoal-directed trackingthrust controlaerial roboticssim-to-real transfernonlinear coupling

0 comments

The pith

Bi-level reinforcement learning decouples center-of-mass planning from thrust control to enable accurate tracking in underactuated blimps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to demonstrate that a bi-level reinforcement learning method can achieve goal-directed tracking for a blimp that has only two thrusters and one movable internal slider to shift its center of mass. The outer policy chooses a fixed CoM location suited to each target before takeoff, while the inner policy learns continuous thrust commands to follow straight-line paths. A two-stage training procedure stabilizes the overall process and comes with a convergence argument. Readers should care because the approach shows that hardware with fewer actuators can still deliver reliable performance if the control task is split this way, which opens the door to lighter, more energy-efficient aerial vehicles.

Core claim

The central claim is that an outer reinforcement learning policy that selects a target-dependent center-of-mass configuration, paired with an inner policy that produces thrust commands, together with a two-stage learning strategy, overcomes the strong nonlinear coupling and underactuation inherent in a compact two-thruster blimp, delivering higher tracking accuracy, robustness, and sim-to-real transfer than either fixed-CoM baselines or PID controllers across a 27-goal test set in both simulation and hardware.

What carries the argument

The bi-level policy architecture that explicitly separates task-level CoM configuration planning from continuous thrust generation, supported by a two-stage training process.

If this is right

Straight-line tracking accuracy improves without adding more thrusters or complex actuators.
Robustness increases relative to both fixed center-of-mass and classical PID controllers.
Sim-to-real transfer succeeds for this class of underactuated aerial systems.
Compact blimp designs become viable for applications that value payload capacity and energy use.
The explicit decoupling reduces the effect of strong nonlinear dynamics on closed-loop performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outer-inner split could be tested on other underactuated vehicles such as certain quadrotors or underwater gliders that allow internal mass shifting.
Allowing the outer policy to adjust the CoM continuously during flight rather than only before takeoff might extend the method to more agile maneuvers.
The convergence analysis of the bi-level process offers a template for designing stable hierarchical controllers in other nonlinear robotic systems.
Longer endurance missions become feasible if the reduced actuator count lowers power draw while maintaining path accuracy.

Load-bearing premise

That splitting control into a pre-flight CoM planning policy and a separate thrust policy, trained in two stages, is sufficient to manage the nonlinear coupling and underactuation without extra actuators or detailed models.

What would settle it

If real-world trials on the 27-goal set showed no improvement in tracking accuracy or robustness over fixed-CoM or PID controllers, or if sim-to-real transfer collapsed, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.01289 by Feitian Zhang, Hao Cheng, Hongwu Wang, Xiaorui Wang, Yue Fan.

**Figure 1.** Figure 1: Problem setting, key insight, and proposed bi-level RL solution for goal-directed tracking control of RGBlimp. The platform is severely underactuated, view at source ↗

**Figure 2.** Figure 2: RGBlimp prototype design includes an envelope, a pair of main wings, view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Bi-Level RL framework for RGBlimp. view at source ↗

**Figure 4.** Figure 4: Learned outer-level slider policy πϕc (c | ζ). Each voxel represents a target ζ = [ζx, ζy, ζz]⊤, with color indicating the selected slider configuration c (cm). The policy shows strong symmetry with respect to ζy and varies primarily with target height, moving the slider backward for higher targets and forward for lower targets. denotes the mechanical zero of the slider rather than a physically critical … view at source ↗

**Figure 5.** Figure 5: Over all 27 targets, PID-SPG yields consistently larger view at source ↗

**Figure 5.** Figure 5: 3D cross-track RMSE for flights from the origin to 27 target points. (a) Task setup of goal-directed tracking from a common start position to different view at source ↗

**Figure 6.** Figure 6: Flight snapshots of Bi-Level RL for three representative targets. The slider position is adaptively adjusted according to the target, after which the blimp view at source ↗

**Figure 7.** Figure 7: Time histories of cross-track error etrk for nine representative targets on the ζx = 4.5 m slice. (a)–(i) correspond to different combinations of ζy ∈ {−2, 0, 2} m and ζz ∈ {−1, 0, 1} m. The dashed curves denote the fixed-slider baselines SAC-Fixed (−5), SAC-Fixed (0), and SAC-Fixed (5), while the solid curve denotes Bi-Level RL view at source ↗

**Figure 8.** Figure 8: Comparison of trajectories between the learned inner SAC controller (Bi-Level RL) and the PID inner controller with the learned slider policy (PID-SPG). view at source ↗

read the original abstract

This paper investigates goal-directed tracking control of underactuated blimps with center-of-mass (CoM) reconfiguration. Unlike conventional overactuated blimp designs that rely on redundant actuation for simplified control, this paper focuses on a compact architecture consisting of two thrusters and a movable internal slider, aiming to improve energy efficiency and payload capacity. This hardware-efficient configuration introduces significant underactuation and strong nonlinear coupling between CoM dynamics and vehicle motion. To address these challenges, this paper proposes a bi-level reinforcement learning framework that explicitly decouples task-level CoM planning from continuous thrust control. The outer policy determines a target-dependent CoM configuration prior to flight, while the inner policy generates thrust commands to track straight-line references. To ensure stable learning, this paper introduces a two-stage learning strategy, supported by a convergence analysis of the resulting bi-level process. Extensive simulations and real-world experiments on a 27-goal evaluation set demonstrate that the proposed method consistently outperforms fixed-CoM baselines and PID-based controllers, achieving higher tracking accuracy, enhanced robustness, and reliable sim-to-real transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bi-level RL approach picks a fixed CoM position via the slider before flight then uses thrust for tracking, and it outperforms fixed-CoM and PID baselines on 27 goals in both sim and hardware.

read the letter

The main takeaway is that this paper shows a practical way to handle underactuation in a compact two-thruster blimp by moving an internal slider to set the center of mass once before each flight, then learning a separate policy for thrust commands to follow straight-line paths. The outer level chooses the CoM target based on the goal, and the inner level handles continuous control, with a two-stage training process to stabilize learning plus a claimed convergence analysis for the overall setup. Their experiments on a 27-goal set report better tracking accuracy, robustness, and sim-to-real transfer than the baselines they compare against. That experimental coverage on real hardware is the strongest part and gives the work concrete value for energy-efficient designs that avoid extra actuators. The new element is the explicit bi-level split applied to this specific slider-equipped platform, which is not just a rehash of prior RL blimp work. The soft spots center on the fixed-CoM assumption after the initial choice. If the chosen position leaves residual nonlinear couplings or unmodeled effects like slider friction and aerodynamics unaddressed during flight, performance could degrade on some goals even if the average looks good. The convergence analysis would need to be checked against those real dynamics rather than just simulation conditions, and the abstract gives no architecture details, exact metrics, or error bars to judge the size of the gains. This paper is aimed at robotics researchers working on underactuated aerial vehicles and RL for minimal-actuation systems. Readers interested in hardware-efficient blimp control or sim-to-real RL would get useful ideas from the setup and results. It has enough real-world grounding to deserve a serious referee, even though revisions will probably be needed for more detailed analysis and ablations. I would send it out for peer review.

Referee Report

3 major / 3 minor

Summary. The paper claims to address goal-directed tracking control for underactuated blimps using a bi-level reinforcement learning framework. It decouples CoM reconfiguration planning (outer policy selecting target-dependent pose pre-flight) from inner thrust control for straight-line tracking. A two-stage learning strategy with convergence analysis is introduced, and extensive sim and real experiments on 27 goals show outperformance over fixed-CoM baselines and PID controllers in tracking accuracy, robustness, and sim-to-real transfer.

Significance. If the central claims hold, this work could significantly advance control strategies for underactuated aerial vehicles by leveraging CoM reconfiguration and RL to handle nonlinear couplings without additional actuators. The two-stage learning and convergence analysis provide a structured approach to stable policy training. Strengths include the hardware-efficient design focus and comprehensive experimental evaluation demonstrating practical sim-to-real applicability. This could influence designs for energy-efficient blimps in applications like surveillance or delivery.

major comments (3)

Convergence analysis section: the analysis of the bi-level process relies on stability margins or Lipschitz conditions that are not independently verified against the real blimp's aerodynamic effects and slider friction. If those margins are violated even on a subset of the 27 goals, the reported tracking accuracy and sim-to-real transfer would not generalize.
Bi-level framework (outer/inner policy decoupling): the assumption that a single pre-flight CoM pose selected by the outer policy renders the underactuated 2-thruster dynamics sufficiently controllable for the inner policy to reject all disturbances without further CoM motion requires stronger justification, as residual nonlinear couplings may persist.
Experimental results on 27-goal set: the claim of consistent outperformance lacks reported quantitative metrics, error bars, or statistical tests in the evaluation details, making it difficult to assess the magnitude and reliability of improvements over fixed-CoM and PID baselines.

minor comments (3)

Abstract: the mention of 'convergence analysis' should briefly note the key assumptions or theorems to aid readers.
Notation throughout: ensure consistent symbols for outer CoM policy versus inner thrust policy to avoid ambiguity in the method description.
Figures in experiments: trajectory plots should include error bands or results from multiple runs to improve clarity of robustness claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: Convergence analysis section: the analysis of the bi-level process relies on stability margins or Lipschitz conditions that are not independently verified against the real blimp's aerodynamic effects and slider friction. If those margins are violated even on a subset of the 27 goals, the reported tracking accuracy and sim-to-real transfer would not generalize.

Authors: We acknowledge that the convergence analysis is derived under modeling assumptions and would be strengthened by explicit validation against hardware-specific effects. In the revised manuscript, we will add a sensitivity analysis that perturbs the Lipschitz constants and stability margins using experimentally measured ranges for aerodynamic drag and slider friction. We will also report the percentage of the 27 goals for which the assumptions hold and include a discussion of any observed performance degradation when margins are approached. revision: yes
Referee: Bi-level framework (outer/inner policy decoupling): the assumption that a single pre-flight CoM pose selected by the outer policy renders the underactuated 2-thruster dynamics sufficiently controllable for the inner policy to reject all disturbances without further CoM motion requires stronger justification, as residual nonlinear couplings may persist.

Authors: The single pre-flight CoM selection is intended to shift the system equilibrium so that straight-line tracking becomes feasible with thrust alone, exploiting the underactuation structure. We agree that residual couplings warrant further justification. The revision will include an expanded controllability analysis (rank conditions on the linearized dynamics after CoM shift) and an ablation experiment comparing fixed-pose versus continuous-CoM policies to quantify the impact of any remaining nonlinearities. revision: partial
Referee: Experimental results on 27-goal set: the claim of consistent outperformance lacks reported quantitative metrics, error bars, or statistical tests in the evaluation details, making it difficult to assess the magnitude and reliability of improvements over fixed-CoM and PID baselines.

Authors: We agree that the current presentation of results would benefit from greater quantitative detail. The revised experimental section will include tables with mean tracking errors and standard deviations for position and orientation across all 27 goals, error bars on all performance plots, and statistical significance tests (paired t-tests) comparing the proposed method against the fixed-CoM and PID baselines in both simulation and real-world trials. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation independent of derivations

full rationale

The paper presents a bi-level RL control method for an underactuated blimp, with an outer policy selecting CoM configuration and an inner policy handling thrust tracking, plus a two-stage training strategy and convergence analysis. All load-bearing claims of superior tracking accuracy, robustness, and sim-to-real transfer are supported by direct experimental comparisons on a 27-goal set against fixed-CoM baselines and PID controllers, rather than any first-principles derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or steps reduce by construction to their inputs; the framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL convergence properties and the assumption that hardware CoM reconfiguration can be effectively planned separately from dynamics; no new physical entities are introduced.

free parameters (1)

RL training hyperparameters
Learning rates, network sizes, and reward weights are typically fitted during the two-stage process but not detailed in the abstract.

axioms (1)

domain assumption The bi-level process converges under the proposed two-stage learning strategy.
Invoked via the mentioned convergence analysis to ensure stable learning of the decoupled policies.

pith-pipeline@v0.9.0 · 5500 in / 1473 out tokens · 65283 ms · 2026-05-09T14:56:44.869689+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Adaptive output feedback trajectory tracking control of an indoor blimp: Controller design and experiment validation,

J. Dong, H. Yu, B. Lu, H. Liu, and Y . Fang, “Adaptive output feedback trajectory tracking control of an indoor blimp: Controller design and experiment validation,”IEEE Transactions on Industrial Electronics, vol. 72, no. 4, pp. 3960–3971, 2025

work page 2025
[2]

Past, present, and future of aerial robotic manipulators,

A. Ollero, M. Tognon, A. Suarez, D. Lee, and A. Franchi, “Past, present, and future of aerial robotic manipulators,”IEEE Transactions on Robotics, vol. 38, no. 1, pp. 626–645, 2022. Fig. 7. Time histories of cross-track errore trk for nine representative targets on theζ x = 4.5 mslice. (a)–(i) correspond to different combinations of ζy ∈ {−2,0,2}mandζ z ∈...

work page 2022
[3]

Prototype, modeling, and control of aerial robots with physical interaction: A review,

H. Zhong, J. Liang, Y . Chen, H. Zhang, J. Mao, and Y . Wang, “Prototype, modeling, and control of aerial robots with physical interaction: A review,”IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 3528–3542, 2025

work page 2025
[4]

A morphing quadrotor-blimp with balloon failure resilience for mobile ecological sensing,

S. Sharma, M. Verhoeff, F. Joosen, R. Venkatesha Prasad, and S. Hamaza, “A morphing quadrotor-blimp with balloon failure resilience for mobile ecological sensing,”IEEE Robotics and Automation Letters, vol. 9, no. 7, pp. 6408–6415, 2024

work page 2024
[5]

Monocular vision-based human following on miniature robotic blimp,

N. Yao, E. Anaya, Q. Tao, S. Cho, H. Zheng, and F. Zhang, “Monocular vision-based human following on miniature robotic blimp,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3244–3249

work page 2017
[6]

Review of autonomous outdoor blimps and their applications,

S. S. Bhat, S. G. Anavatti, M. Garratt, and S. Ravi, “Review of autonomous outdoor blimps and their applications,”Drone Systems and Applications, vol. 12, pp. 1–21, 2024

work page 2024
[7]

Local positioning system using uwb range measurements for an unmanned blimp,

V . Mai, M. Kamel, M. Krebs, A. Schaffner, D. Meier, L. Paull, and R. Siegwart, “Local positioning system using uwb range measurements for an unmanned blimp,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 2971–2978, 2018

work page 2018
[8]

Human pointing motion during interaction with an autonomous blimp,

M. Hou and F. Zhang, “Human pointing motion during interaction with an autonomous blimp,”Scientific Reports, vol. 12, p. 11402, 07 2022

work page 2022
[9]

Sblimp: Design, model, and translational motion control for a swing-blimp,

J. Xu, D. S. D’Antonio, D. J. Ammirato, and D. Salda ˜na, “Sblimp: Design, model, and translational motion control for a swing-blimp,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 6977–6982

work page 2023
[10]

Swing- reducing flight control system for an underactuated indoor miniature autonomous blimp,

Q. Tao, J. Wang, Z. Xu, T. X. Lin, Y . Yuan, and F. Zhang, “Swing- reducing flight control system for an underactuated indoor miniature autonomous blimp,”IEEE/ASME Transactions on Mechatronics, vol. 26, no. 4, pp. 1895–1904, 2021

work page 1904
[11]

Rgblimp-q: Robotic gliding blimp with moving mass control based on a bird-inspired continuum arm,

H. Cheng and F. Zhang, “Rgblimp-q: Robotic gliding blimp with moving mass control based on a bird-inspired continuum arm,”IEEE Transactions on Robotics, vol. 41, pp. 5097–5116, 2025

work page 2025
[12]

Feedback linearization of an underactuated miniature blimp with zero dynamics mitigation using high order control barrier functions,

M. Kasmalkar, L. Baird, and S. Coogan, “Feedback linearization of an underactuated miniature blimp with zero dynamics mitigation using high order control barrier functions,”IEEE Control Systems Letters, vol. 8, pp. 2589–2594, 2024

work page 2024
[13]

Bioinspired intermittent control of a miniature autonomous blimp for tracking a moving target,

R. J. Suitor, D. Sofge, and D. A. Paley, “Bioinspired intermittent control of a miniature autonomous blimp for tracking a moving target,” in OCEANS 2024 - Halifax, 2024, pp. 1–9

work page 2024
[14]

Design and autonomous control of a solar-power blimp,

C. Wan, N. Kingry, and R. Dai, “Design and autonomous control of a solar-power blimp,” 01 2018

work page 2018
[15]

Developing a low-cost autonomous indoor blimp,

J. L ´opez, G. P, R. Sanz, and W. Burgard, “Developing a low-cost autonomous indoor blimp,”Journal of Physical Agents, vol. 3, 01 2009

work page 2009
[16]

Mission analysis, dynamics and robust control of an indoor blimp in a cern detector magnetic environment,

F. Mazzei, L. Teofili, F. Curti, and C. Gargiulo, “Mission analysis, dynamics and robust control of an indoor blimp in a cern detector magnetic environment,”Frontiers in Robotics and AI, vol. 10, 10 2023

work page 2023
[17]

An underactuated control system design for adaptive autopilot of fixed-wing drones,

S. Baldi, S. Roy, K. Yang, and D. Liu, “An underactuated control system design for adaptive autopilot of fixed-wing drones,”IEEE/ASME Transactions on Mechatronics, vol. 27, no. 5, pp. 4045–4056, 2022

work page 2022
[18]

Action- based contrastive unsupervised representations for reinforcement learn- ing toward robotic manipulation,

Q. Chen, C. Ye, W. Lin, Z. Liu, X. Yu, J. Qiu, and H. Gao, “Action- based contrastive unsupervised representations for reinforcement learn- ing toward robotic manipulation,”IEEE Transactions on Industrial Electronics, vol. 73, no. 2, pp. 3104–3113, 2026

work page 2026
[19]

A novel robotic skill learning approach for assembly task with dynamical system and broad learning,

J. Zhang, Z. Jin, Z. Zhao, and C. Yang, “A novel robotic skill learning approach for assembly task with dynamical system and broad learning,” IEEE Transactions on Industrial Electronics, vol. 72, no. 9, pp. 9304– 9313, 2025

work page 2025
[20]

Broad reinforcement learning for adaptive control of a 2-dof helicopter system with unknown dead zone,

Z. Zhao, Y . Weng, Z. Liu, C. Yang, and C. L. P. Chen, “Broad reinforcement learning for adaptive control of a 2-dof helicopter system with unknown dead zone,”IEEE Transactions on Industrial Electronics, vol. 72, no. 4, pp. 3984–3993, 2025

work page 2025
[21]

Autonomous blimp control viah ∞ robust deep residual reinforcement learning,

Y . Zuo, Y . T. Liu, and A. Ahmad, “Autonomous blimp control viah ∞ robust deep residual reinforcement learning,” in2023 IEEE 19th Inter- national Conference on Automation Science and Engineering (CASE), 2023, pp. 1–8

work page 2023
[22]

Gthsl: A goal-task-driven hierarchical sharing learning method to learn long- horizon tasks autonomously,

R. Jiang, X. Cheng, H. Sang, Z. Wang, Y . Zhou, and B. He, “Gthsl: A goal-task-driven hierarchical sharing learning method to learn long- horizon tasks autonomously,”IEEE Transactions on Industrial Electron- ics, vol. 72, no. 4, pp. 3994–4005, 2025

work page 2025
[23]

Hierarchical reinforcement learning with universal policies for multi- step robotic manipulation,

X. Yang, Z. Ji, J. Wu, Y .-K. Lai, C. Wei, G. Liu, and R. Setchi, “Hierarchical reinforcement learning with universal policies for multi- step robotic manipulation,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4727–4741, 2022

work page 2022
[24]

A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation,

W. Zhu and M. Hayashibe, “A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation,”IEEE Transactions on Industrial Electronics, vol. 70, no. 5, pp. 4962–4971, 2023

work page 2023
[25]

Planning-augmented hierarchical reinforcement learning,

R. Gieselmann and F. T. Pokorny, “Planning-augmented hierarchical reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5097–5104, 2021

work page 2021
[26]

Rgblimp: Robotic gliding blimp - design, modeling, development, and aerodynamics analysis,

H. Cheng, Z. Sha, Y . Zhu, and F. Zhang, “Rgblimp: Robotic gliding blimp - design, modeling, development, and aerodynamics analysis,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7273–7280, 2023

work page 2023
[27]

Real-world learning control for autonomous exploration of a biomimetic robotic shark,

S. Yan, Z. Wu, J. Wang, Y . Huang, M. Tan, and J. Yu, “Real-world learning control for autonomous exploration of a biomimetic robotic shark,”IEEE Transactions on Industrial Electronics, vol. 70, no. 4, pp. 3966–3974, 2023

work page 2023
[28]

Chaos-augmented reinforcement learning with diffusion models for robust legged robot locomotion,

H. Zhang, C. Hua, J. Chen, X. Luo, and J. Wei, “Chaos-augmented reinforcement learning with diffusion models for robust legged robot locomotion,”IEEE Transactions on Industrial Electronics, vol. 73, no. 2, pp. 2600–2609, 2026

work page 2026
[29]

Task and domain adaptive rein- forcement learning for robot control,

Y . T. Liu, N. Singh, and A. Ahmad, “Task and domain adaptive rein- forcement learning for robot control,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 656– 663

work page 2024
[30]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870

work page 2018
[31]

Provably convergent two-timescale off-policy actor-critic with function approximation,

S. Zhang, B. Liu, H. Yao, and S. Whiteson, “Provably convergent two-timescale off-policy actor-critic with function approximation,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 11 204–11 213

work page 2020
[32]

On actor-critic algorithms,

V . R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,”SIAM J. Control Optim., vol. 42, no. 4, p. 1143–1166, Apr. 2003

work page 2003
[33]

A stochastic approximation method,

H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951. Xiaorui Wang(Student Member, IEEE) received the bachelor’s degree in robotics engineering in 2025 from Peking University, Beijing, China, where he is currently working toward the Ph.D. degree in general mechanics and foundation of mechani...

work page 1951
[34]

His research interests include mechatronics systems, robotics and controls, aerial vehicles, and underwater vehicles

He is currently an Associate Professor of Robotics Engineering with Peking University, Beijing, China. His research interests include mechatronics systems, robotics and controls, aerial vehicles, and underwater vehicles

work page

[1] [1]

Adaptive output feedback trajectory tracking control of an indoor blimp: Controller design and experiment validation,

J. Dong, H. Yu, B. Lu, H. Liu, and Y . Fang, “Adaptive output feedback trajectory tracking control of an indoor blimp: Controller design and experiment validation,”IEEE Transactions on Industrial Electronics, vol. 72, no. 4, pp. 3960–3971, 2025

work page 2025

[2] [2]

Past, present, and future of aerial robotic manipulators,

A. Ollero, M. Tognon, A. Suarez, D. Lee, and A. Franchi, “Past, present, and future of aerial robotic manipulators,”IEEE Transactions on Robotics, vol. 38, no. 1, pp. 626–645, 2022. Fig. 7. Time histories of cross-track errore trk for nine representative targets on theζ x = 4.5 mslice. (a)–(i) correspond to different combinations of ζy ∈ {−2,0,2}mandζ z ∈...

work page 2022

[3] [3]

Prototype, modeling, and control of aerial robots with physical interaction: A review,

H. Zhong, J. Liang, Y . Chen, H. Zhang, J. Mao, and Y . Wang, “Prototype, modeling, and control of aerial robots with physical interaction: A review,”IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 3528–3542, 2025

work page 2025

[4] [4]

A morphing quadrotor-blimp with balloon failure resilience for mobile ecological sensing,

S. Sharma, M. Verhoeff, F. Joosen, R. Venkatesha Prasad, and S. Hamaza, “A morphing quadrotor-blimp with balloon failure resilience for mobile ecological sensing,”IEEE Robotics and Automation Letters, vol. 9, no. 7, pp. 6408–6415, 2024

work page 2024

[5] [5]

Monocular vision-based human following on miniature robotic blimp,

N. Yao, E. Anaya, Q. Tao, S. Cho, H. Zheng, and F. Zhang, “Monocular vision-based human following on miniature robotic blimp,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3244–3249

work page 2017

[6] [6]

Review of autonomous outdoor blimps and their applications,

S. S. Bhat, S. G. Anavatti, M. Garratt, and S. Ravi, “Review of autonomous outdoor blimps and their applications,”Drone Systems and Applications, vol. 12, pp. 1–21, 2024

work page 2024

[7] [7]

Local positioning system using uwb range measurements for an unmanned blimp,

V . Mai, M. Kamel, M. Krebs, A. Schaffner, D. Meier, L. Paull, and R. Siegwart, “Local positioning system using uwb range measurements for an unmanned blimp,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 2971–2978, 2018

work page 2018

[8] [8]

Human pointing motion during interaction with an autonomous blimp,

M. Hou and F. Zhang, “Human pointing motion during interaction with an autonomous blimp,”Scientific Reports, vol. 12, p. 11402, 07 2022

work page 2022

[9] [9]

Sblimp: Design, model, and translational motion control for a swing-blimp,

J. Xu, D. S. D’Antonio, D. J. Ammirato, and D. Salda ˜na, “Sblimp: Design, model, and translational motion control for a swing-blimp,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 6977–6982

work page 2023

[10] [10]

Swing- reducing flight control system for an underactuated indoor miniature autonomous blimp,

Q. Tao, J. Wang, Z. Xu, T. X. Lin, Y . Yuan, and F. Zhang, “Swing- reducing flight control system for an underactuated indoor miniature autonomous blimp,”IEEE/ASME Transactions on Mechatronics, vol. 26, no. 4, pp. 1895–1904, 2021

work page 1904

[11] [11]

Rgblimp-q: Robotic gliding blimp with moving mass control based on a bird-inspired continuum arm,

H. Cheng and F. Zhang, “Rgblimp-q: Robotic gliding blimp with moving mass control based on a bird-inspired continuum arm,”IEEE Transactions on Robotics, vol. 41, pp. 5097–5116, 2025

work page 2025

[12] [12]

Feedback linearization of an underactuated miniature blimp with zero dynamics mitigation using high order control barrier functions,

M. Kasmalkar, L. Baird, and S. Coogan, “Feedback linearization of an underactuated miniature blimp with zero dynamics mitigation using high order control barrier functions,”IEEE Control Systems Letters, vol. 8, pp. 2589–2594, 2024

work page 2024

[13] [13]

Bioinspired intermittent control of a miniature autonomous blimp for tracking a moving target,

R. J. Suitor, D. Sofge, and D. A. Paley, “Bioinspired intermittent control of a miniature autonomous blimp for tracking a moving target,” in OCEANS 2024 - Halifax, 2024, pp. 1–9

work page 2024

[14] [14]

Design and autonomous control of a solar-power blimp,

C. Wan, N. Kingry, and R. Dai, “Design and autonomous control of a solar-power blimp,” 01 2018

work page 2018

[15] [15]

Developing a low-cost autonomous indoor blimp,

J. L ´opez, G. P, R. Sanz, and W. Burgard, “Developing a low-cost autonomous indoor blimp,”Journal of Physical Agents, vol. 3, 01 2009

work page 2009

[16] [16]

Mission analysis, dynamics and robust control of an indoor blimp in a cern detector magnetic environment,

F. Mazzei, L. Teofili, F. Curti, and C. Gargiulo, “Mission analysis, dynamics and robust control of an indoor blimp in a cern detector magnetic environment,”Frontiers in Robotics and AI, vol. 10, 10 2023

work page 2023

[17] [17]

An underactuated control system design for adaptive autopilot of fixed-wing drones,

S. Baldi, S. Roy, K. Yang, and D. Liu, “An underactuated control system design for adaptive autopilot of fixed-wing drones,”IEEE/ASME Transactions on Mechatronics, vol. 27, no. 5, pp. 4045–4056, 2022

work page 2022

[18] [18]

Action- based contrastive unsupervised representations for reinforcement learn- ing toward robotic manipulation,

Q. Chen, C. Ye, W. Lin, Z. Liu, X. Yu, J. Qiu, and H. Gao, “Action- based contrastive unsupervised representations for reinforcement learn- ing toward robotic manipulation,”IEEE Transactions on Industrial Electronics, vol. 73, no. 2, pp. 3104–3113, 2026

work page 2026

[19] [19]

A novel robotic skill learning approach for assembly task with dynamical system and broad learning,

J. Zhang, Z. Jin, Z. Zhao, and C. Yang, “A novel robotic skill learning approach for assembly task with dynamical system and broad learning,” IEEE Transactions on Industrial Electronics, vol. 72, no. 9, pp. 9304– 9313, 2025

work page 2025

[20] [20]

Broad reinforcement learning for adaptive control of a 2-dof helicopter system with unknown dead zone,

Z. Zhao, Y . Weng, Z. Liu, C. Yang, and C. L. P. Chen, “Broad reinforcement learning for adaptive control of a 2-dof helicopter system with unknown dead zone,”IEEE Transactions on Industrial Electronics, vol. 72, no. 4, pp. 3984–3993, 2025

work page 2025

[21] [21]

Autonomous blimp control viah ∞ robust deep residual reinforcement learning,

Y . Zuo, Y . T. Liu, and A. Ahmad, “Autonomous blimp control viah ∞ robust deep residual reinforcement learning,” in2023 IEEE 19th Inter- national Conference on Automation Science and Engineering (CASE), 2023, pp. 1–8

work page 2023

[22] [22]

Gthsl: A goal-task-driven hierarchical sharing learning method to learn long- horizon tasks autonomously,

R. Jiang, X. Cheng, H. Sang, Z. Wang, Y . Zhou, and B. He, “Gthsl: A goal-task-driven hierarchical sharing learning method to learn long- horizon tasks autonomously,”IEEE Transactions on Industrial Electron- ics, vol. 72, no. 4, pp. 3994–4005, 2025

work page 2025

[23] [23]

Hierarchical reinforcement learning with universal policies for multi- step robotic manipulation,

X. Yang, Z. Ji, J. Wu, Y .-K. Lai, C. Wei, G. Liu, and R. Setchi, “Hierarchical reinforcement learning with universal policies for multi- step robotic manipulation,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4727–4741, 2022

work page 2022

[24] [24]

A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation,

W. Zhu and M. Hayashibe, “A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation,”IEEE Transactions on Industrial Electronics, vol. 70, no. 5, pp. 4962–4971, 2023

work page 2023

[25] [25]

Planning-augmented hierarchical reinforcement learning,

R. Gieselmann and F. T. Pokorny, “Planning-augmented hierarchical reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5097–5104, 2021

work page 2021

[26] [26]

Rgblimp: Robotic gliding blimp - design, modeling, development, and aerodynamics analysis,

H. Cheng, Z. Sha, Y . Zhu, and F. Zhang, “Rgblimp: Robotic gliding blimp - design, modeling, development, and aerodynamics analysis,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7273–7280, 2023

work page 2023

[27] [27]

Real-world learning control for autonomous exploration of a biomimetic robotic shark,

S. Yan, Z. Wu, J. Wang, Y . Huang, M. Tan, and J. Yu, “Real-world learning control for autonomous exploration of a biomimetic robotic shark,”IEEE Transactions on Industrial Electronics, vol. 70, no. 4, pp. 3966–3974, 2023

work page 2023

[28] [28]

Chaos-augmented reinforcement learning with diffusion models for robust legged robot locomotion,

H. Zhang, C. Hua, J. Chen, X. Luo, and J. Wei, “Chaos-augmented reinforcement learning with diffusion models for robust legged robot locomotion,”IEEE Transactions on Industrial Electronics, vol. 73, no. 2, pp. 2600–2609, 2026

work page 2026

[29] [29]

Task and domain adaptive rein- forcement learning for robot control,

Y . T. Liu, N. Singh, and A. Ahmad, “Task and domain adaptive rein- forcement learning for robot control,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 656– 663

work page 2024

[30] [30]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870

work page 2018

[31] [31]

Provably convergent two-timescale off-policy actor-critic with function approximation,

S. Zhang, B. Liu, H. Yao, and S. Whiteson, “Provably convergent two-timescale off-policy actor-critic with function approximation,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 11 204–11 213

work page 2020

[32] [32]

On actor-critic algorithms,

V . R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,”SIAM J. Control Optim., vol. 42, no. 4, p. 1143–1166, Apr. 2003

work page 2003

[33] [33]

A stochastic approximation method,

H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951. Xiaorui Wang(Student Member, IEEE) received the bachelor’s degree in robotics engineering in 2025 from Peking University, Beijing, China, where he is currently working toward the Ph.D. degree in general mechanics and foundation of mechani...

work page 1951

[34] [34]

His research interests include mechatronics systems, robotics and controls, aerial vehicles, and underwater vehicles

He is currently an Associate Professor of Robotics Engineering with Peking University, Beijing, China. His research interests include mechatronics systems, robotics and controls, aerial vehicles, and underwater vehicles

work page