pith. sign in

arxiv: 2605.19592 · v1 · pith:M6BJLCDKnew · submitted 2026-05-19 · 💻 cs.RO · cs.AI

Implicit Action Chunking for Smooth Continuous Control

Pith reviewed 2026-05-20 05:03 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords reinforcement learningcontinuous controlaction chunkingsmooth controldual-window smoothingDeepMind Control Suiteautonomous driving
0
0 comments X

The pith

Dual-Window Smoothing produces smooth continuous control in reinforcement learning by implicitly chunking actions without expanding the output space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dual-Window Smoothing as a method to reduce high-frequency jitter in reinforcement learning policies for physical systems. It achieves this through an execution window that modulates actions deterministically for smoothness and a value window that aligns temporal-difference targets to remove bias from open-loop execution. A lightweight regularizer on action differences further promotes continuity. Sympathetic readers would care because oscillatory controls undermine safety and efficiency in robotics, energy systems, and autonomous driving, while explicit chunking methods complicate optimization by growing the action dimension with the horizon.

Core claim

Dual-Window Smoothing is an implicit action chunking framework that enforces temporal coherence in continuous control without expanding the policy's action space. It relies on a dual-window design—an execution window for deterministic modulation that guarantees physical smoothness and a value window that aligns temporal-difference targets over the horizon to correct critic bias induced by open-loop execution—plus a first-order action-difference regularizer on the actor to encourage global continuity.

What carries the argument

Dual-window design consisting of an execution window for deterministic action modulation and a value window for temporal-difference target alignment, augmented by an actor-side first-order action-difference regularizer.

If this is right

  • Outperforms state-of-the-art baselines on the DeepMind Control Suite and industrial energy management tasks.
  • Produces smoother control signals and safer behavior with reduced jitter in vision-based autonomous driving.
  • Achieves a 100 percent success rate on complex vision-based autonomous driving tasks.
  • Bridges temporal abstraction with standard step-wise reactive control without changing the interaction interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual-window separation could be tested in other continuous domains such as legged locomotion where jitter directly affects energy use.
  • Removing the value window in ablation studies would isolate whether bias correction or modulation alone drives most of the reported stability.
  • The first-order regularizer might combine with higher-order penalties to further reduce acceleration in hardware deployments.

Load-bearing premise

The value window correctly aligns temporal-difference targets over the horizon without introducing compensating errors that cancel the smoothness gains.

What would settle it

Running the same tasks with the value window disabled or randomly offset and checking whether the reported reductions in jitter and performance gains disappear.

Figures

Figures reproduced from arXiv: 2605.19592 by Bosun Liang, Chen Sun, Chuanzhi Fan, Huachun Tan, Shuo Pei, Yong Wang, Yuankai Wu, Zirui Chen.

Figure 1
Figure 1. Figure 1: Explicit vs. implicit action chunking. Explicit chunk￾ing outputs an h-step action sequence in one decision (dimension hd) and executes it open-loop. DWS keeps a d-dimensional policy output a and induces chunk-like temporal coherence by executing u = Fh(a) step-wise. Expert-guided Reinforcement Learning. In safety￾critical domains or tasks with sparse rewards, relying solely on stochastic exploration is of… view at source ↗
Figure 3
Figure 3. Figure 3: Value Window. A contiguous length-h executed seg￾ment is sliced from the ordered buffer W to form a h-step win￾dowed target G (h) t . The critic supervision target interpolates be￾tween the one-step TD target and the windowed target using the segment-valid gate zt. can bias value estimates toward step-wise (potentially jit￾tery) corrections and lead to critic myopia with respect to segment-wise coherent ex… view at source ↗
Figure 4
Figure 4. Figure 4: Performance Comparison. Radar charts showing total returns across five DMC tasks for TD3 (Left) and SAC (Right) backbones. DWS (Blue) achieves the largest coverage area, indi￾cating that it attains SOTA returns without the performance degra￾dation observed in constrained smoothing policies (LipsNet++) or explicit chunking methods (ActionChunk). Full quantitative tables are provided in Section D. −1.0 −0.5 … view at source ↗
Figure 5
Figure 5. Figure 5: presents a microscopic view of action trajectories in Reacher-Hard. Standard RL baselines tend to degener￾ate into saturation-level switching behaviors to minimize instantaneous tracking error, resulting in pronounced high￾frequency oscillations. In contrast, DWS produces very smooth and stable control actions by generating locally co￾herent signals through the execution window design. This design uses a d… view at source ↗
Figure 6
Figure 6. Figure 6: Relative smoothness improvement. Action smoothness improvements, measured by AFR reduction, of DWS-TD3 and DWS-SAC relative to their Vanilla counterparts across five Deep￾Mind benchmarks. sustainability, making it well suited for long-horizon energy management task. This case study also highlights the key role of action smoothness in real-world control systems [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on the LCO task (example episode). Trajectory and steering profiles of HG-TD3 (blue) and the proposed DWS (orange). senger comfort. As detailed in [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: summarizes how DWS augments a standard off-policy actor–critic loop with a shared horizon h. At each window boundary, the actor outputs a reference action a RL t and the Execution Window deterministically produces step-wise executed actions {ut, . . . , ut+h−1}. Transitions are stored in replay D and also appended to the ordered Window Buffer W, from which contiguous length-h segments enable terminal-safe… view at source ↗
Figure 11
Figure 11. Figure 11: Visualizations of DMC tasks. These environments act as proxies for core autonomous driving challenges: (a) Reacher for precision tracking; (b) Ball-in-Cup for impulse handling and dynamic recovery; (c) Cart-pole for stabilization of unstable equilibria; and (d) Point Mass for inertial control and oscillation suppression. • Precision Control (Reacher - Easy & Hard): The agent must control a robotic arm to … view at source ↗
Figure 13
Figure 13. Figure 13: (a) Cheetah (b) Walker [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Training dynamics on the FCEV energy management task. The curves display the mean evaluation return ± standard deviation over 1,000 episodes. The zoomed-in inset (bottom right) highlights the asymptotic performance in the final phase, where DWS demonstrates superior stability and highest converged return compared to baselines. E.3. Training Dynamics and Performance Evaluation To comprehensively evaluate t… view at source ↗
Figure 15
Figure 15. Figure 15: LCO scenario in CARLA. informative transitions based on a learned advantage signal. We refer to this common training pipeline as HG-TD3. All CARLA methods share the same HG-TD3 backbone and differ only by the added action-smoothing component. Baselines. We compare against representative smoothing approaches integrated into the same backbone: L2C2, Ac￾tionChunk, LipsNet++, and SmODE, as well as the plain b… view at source ↗
Figure 16
Figure 16. Figure 16: LCO: robustness to NPC speed. Success rate (%) across NPC speeds {0, 1, 2, 3, 4, 5} m/s [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Pairwise steering trajectories in LCO (appendix). For each baseline, we show steering angle (top) and steering increment ∆steer (bottom) from representative successful episodes under identical plotting/smoothing settings. Compared to each baseline, DWS consistently suppresses high-frequency oscillations in ∆steer, which reduces lateral “wobbling” and helps avoid boundary-violation terminations in the narr… view at source ↗
Figure 18
Figure 18. Figure 18: AEB scenario in CARLA. Representative camera views. F.3. CARLA: AEB (Autonomous Emergency Braking) F.3.1. SCENARIO AND TERMINATION We specify the AEB scenario in CARLA Town01 as follows. The ego vehicle starts from a fixed spawn point (x, y) = (338.5, 190.0) and cruises at a target speed of 60 km/h (16.7 m/s). A pedestrian is spawned ahead at (x, y) = (331.0, 260.0) and walks laterally with a constant spe… view at source ↗
Figure 19
Figure 19. Figure 19: AEB training-time evaluation curves (every 5 episodes). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: AEB learning curves. Noise-free evaluation every 5 episodes; shaded regions show variability. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
read the original abstract

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control in reinforcement learning. Unlike explicit chunking, DWS avoids expanding the policy output dimension by using a dual-window design: an execution window applies deterministic modulation for physical smoothness, while a value window aligns multi-step TD targets to correct critic bias induced by open-loop execution. A lightweight first-order action-difference regularizer is added on the actor side. Experiments on the DeepMind Control Suite, industrial energy management tasks, and vision-based autonomous driving report outperformance over SOTA baselines, smoother control, reduced jitter, and a 100% success rate in driving.

Significance. If the bias-correction claim and empirical gains hold after rigorous validation, the work could meaningfully improve the deployability of RL policies in safety-critical continuous-control domains such as robotics and autonomous driving by providing a scalable alternative to explicit temporal abstraction.

major comments (2)
  1. [Abstract (dual-window design)] Abstract (dual-window design): The value window is described as aligning 'temporal-difference targets over the horizon to correct critic bias caused by open-loop execution' via deterministic modulation plus alignment, yet no equation or derivation shows that the resulting target equals the unbiased multi-step return under the execution policy. Without this, residual bias or new temporal inconsistencies cannot be ruled out, and performance improvements could be artifacts of the first-order regularizer rather than the dual-window construction.
  2. [Experiments] Experiments: The reported 100% success rate and outperformance on DeepMind Control Suite and driving tasks are presented without error bars, ablation results isolating the value window's contribution, or statistical tests. This weakens the ability to attribute gains specifically to bias correction.
minor comments (1)
  1. [Method] Notation for the execution and value windows could be introduced with explicit symbols and a small diagram to clarify the temporal offset between the two windows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below with clarifications and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract (dual-window design)] The value window is described as aligning 'temporal-difference targets over the horizon to correct critic bias caused by open-loop execution' via deterministic modulation plus alignment, yet no equation or derivation shows that the resulting target equals the unbiased multi-step return under the execution policy. Without this, residual bias or new temporal inconsistencies cannot be ruled out, and performance improvements could be artifacts of the first-order regularizer rather than the dual-window construction.

    Authors: We agree that an explicit derivation would strengthen the bias-correction claim. In the revised manuscript we will add a step-by-step derivation (in the main text or appendix) showing that the value-window target equals the unbiased multi-step return under the deterministic execution policy. The alignment step recomputes the TD targets using the modulated actions that are actually executed, thereby removing the distribution shift that otherwise biases the critic; the first-order regularizer is a separate, lightweight term whose isolated effect is quantified in the ablations. revision: yes

  2. Referee: [Experiments] The reported 100% success rate and outperformance on DeepMind Control Suite and driving tasks are presented without error bars, ablation results isolating the value window's contribution, or statistical tests. This weakens the ability to attribute gains specifically to bias correction.

    Authors: We accept that the current experimental section would benefit from additional statistical rigor. The revised version will report mean and standard deviation over at least five random seeds with error bars, include a dedicated ablation table that removes the value window while keeping the execution window and regularizer, and add paired statistical tests (e.g., Wilcoxon signed-rank) to assess significance of the reported gains. These changes will make the attribution to the dual-window design clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method defined procedurally without reduction to fitted inputs or self-citations

full rationale

The paper introduces Dual-Window Smoothing (DWS) through an explicit dual-window construction consisting of an execution window for deterministic modulation and a value window for TD target alignment, plus a first-order action regularizer. These elements are presented as design choices that enforce temporal coherence without expanding the action space or relying on any fitted parameter that is then renamed as a prediction. No equations appear in the provided text that equate a claimed performance gain (such as smoothness or success rate) back to the same data or a self-referential definition. The central claims rest on the procedural definition and experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled from prior work by the same authors. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the unproven claim that the specific dual-window alignment removes critic bias without side effects. No new physical constants or particles are introduced.

axioms (1)
  • domain assumption Standard Markov decision process formulation and temporal-difference learning remain valid when actions are deterministically modulated over a short execution window.
    Invoked in the description of how the execution window interacts with the critic.

pith-pipeline@v0.9.0 · 5755 in / 1261 out tokens · 34107 ms · 2026-05-20T05:03:41.168583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 7 internal anchors

  1. [1]

    2025 , url=

    Yinuo Wang and Wenxuan Wang and Xujie Song and Tong Liu and Yuming Yin and Liangfa Chen and Likun Wang and Jingliang Duan and Shengbo Eben Li , booktitle=. 2025 , url=

  2. [2]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Reinforcement Learning with Action Chunking , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  3. [3]

    Forty-second International Conference on Machine Learning , year=

    LipsNet++: Unifying Filter and Controller into a Policy Network , author=. Forty-second International Conference on Machine Learning , year=

  4. [4]

    2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

    L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning , author=. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2022 , organization=

  5. [5]

    2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

    Mmfn: Multi-modal-fusion-net for end-to-end driving , author=. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2022 , organization=

  6. [6]

    Robotics and Computer-Integrated Manufacturing , volume=

    A review on reinforcement learning for contact-rich robotic manipulation tasks , author=. Robotics and Computer-Integrated Manufacturing , volume=. 2023 , publisher=

  7. [7]

    DeepMind Control Suite

    DeepMind Control Suite , author =. arXiv preprint arXiv:1801.00690 , year =

  8. [8]

    Conference on robot learning , pages=

    CARLA: An open urban driving simulator , author=. Conference on robot learning , pages=. 2017 , organization=

  9. [9]

    Physical Intelligence and Black, Kevin and Brown, Noah and Darpinian, James and Dhabalia, Karan and Driess, Danny and Esmail, Adnan and Equi, Michael and Finn, Chelsea and Fusai, Niccolo and others , journal=. _

  10. [10]

    2024 , url=

    _0 : A Vision-Language-Action Model for General-Purpose Robot Manipulation , author=. 2024 , url=

  11. [11]

    Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

    Pure vision language action (vla) models: A comprehensive survey , author=. arXiv preprint arXiv:2509.19012 , year=

  12. [12]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , year=

    End-to-end autonomous driving: Challenges and frontiers , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , year=

  13. [13]

    Transportation Research Part C: Emerging Technologies , volume=

    Recent advances in reinforcement learning-based autonomous driving behavior planning: A survey , author=. Transportation Research Part C: Emerging Technologies , volume=. 2024 , publisher=

  14. [14]

    Real-Time Execution of Action Chunking Flow Policies

    Real-Time Execution of Action Chunking Flow Policies , author=. arXiv preprint arXiv:2506.07339 , year=

  15. [15]

    Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

    Training-time action conditioning for efficient real-time chunking , author=. arXiv preprint arXiv:2512.05964 , year=

  16. [16]

    2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

  17. [17]

    IEEE Transactions on Intelligent Vehicles , year=

    Smooth filtering neural network for reinforcement learning , author=. IEEE Transactions on Intelligent Vehicles , year=

  18. [18]

    IEEE Transactions on Intelligent Transportation Systems , year=

    A Deep Reinforcement Learning Method for Autonomous Driving Integrating Multi-Modal Fusion , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  19. [19]

    Science Robotics , volume=

    Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning , author=. Science Robotics , volume=. 2025 , publisher=

  20. [20]

    Nature Machine Intelligence , volume=

    Continuous improvement of self-driving cars using dynamic confidence-aware reinforcement learning , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  21. [21]

    IEEE Transactions on Cybernetics , year=

    EKG-AC: A new paradigm for process industrial optimization based on offline reinforcement learning with expert knowledge guidance , author=. IEEE Transactions on Cybernetics , year=

  22. [22]

    IEEE Transactions on Industrial Informatics , year=

    A Reinforcement Learning Method With an Expert Guidance Mechanism for Manipulator Trajectory Generation , author=. IEEE Transactions on Industrial Informatics , year=

  23. [23]

    Engineering , year=

    LearningEMS: A Unified Framework and Open-Source Benchmark for Learning-Based Energy Management of Electric Vehicles , author=. Engineering , year=

  24. [24]

    Nature Communications , volume=

    Data-driven energy management for electric vehicles using offline reinforcement learning , author=. Nature Communications , volume=. 2025 , publisher=

  25. [25]

    arXiv preprint arXiv:2311.18636 , year=

    End-to-end autonomous driving using deep learning: A systematic review , author=. arXiv preprint arXiv:2311.18636 , year=

  26. [26]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  27. [27]

    2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

    To ask for help or not to ask: A predictive approach to human-in-the-loop motion planning for robot manipulation tasks , author=. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2022 , organization=

  28. [28]

    International Conference on Machine Learning , pages=

    Guided exploration with proximal policy optimization using a single demonstration , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  29. [29]

    2018 IEEE international conference on robotics and automation (ICRA) , pages=

    Overcoming exploration in reinforcement learning with demonstrations , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

  30. [30]

    Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

    Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards , author=. arXiv preprint arXiv:1707.08817 , year=

  31. [31]

    Trial without Error: Towards Safe Reinforcement Learning via Human Intervention

    Trial without error: Towards safe reinforcement learning via human intervention , author=. arXiv preprint arXiv:1707.05173 , year=

  32. [32]

    2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC) , pages=

    Safe decision-making for lane-change of autonomous vehicles via human demonstration-aided reinforcement learning , author=. 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC) , pages=. 2022 , organization=

  33. [33]

    arXiv preprint arXiv:1909.01387 , year=

    Making efficient use of demonstrations to solve hard exploration problems , author=. arXiv preprint arXiv:1909.01387 , year=

  34. [34]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Deep q-learning from demonstrations , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  35. [35]

    Advances in neural information processing systems , volume=

    Reward learning from human preferences and demonstrations in atari , author=. Advances in neural information processing systems , volume=

  36. [36]

    2024 8th CAA International Conference on Vehicular Control and Intelligence (CVCI) , pages=

    Human-Guided Reinforcement Learning Using Multi Q-Advantage for End-to-End Autonomous Driving , author=. 2024 8th CAA International Conference on Vehicular Control and Intelligence (CVCI) , pages=. 2024 , organization=

  37. [37]

    IEEE Transactions on Intelligent Transportation Systems , year=

    Human-guided continual learning for personalized decision-making of autonomous driving , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  38. [38]

    IEEE Transactions on Intelligent Transportation Systems , year=

    Explainable AI for safe and trustworthy autonomous driving: A systematic review , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  39. [39]

    Neurocomputing , volume=

    Multi-modality 3D object detection in autonomous driving: A review , author=. Neurocomputing , volume=. 2023 , publisher=

  40. [40]

    IEEE Transactions on Systems, Man, and Cybernetics: Systems , volume=

    Human-guided deep reinforcement learning for optimal decision making of autonomous vehicles , author=. IEEE Transactions on Systems, Man, and Cybernetics: Systems , volume=. 2024 , publisher=

  41. [41]

    2019 International Conference on Robotics and Automation (ICRA) , pages=

    Hg-dagger: Interactive imitation learning with human experts , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

  42. [42]

    Prioritized Experience Replay

    Prioritized experience replay , author=. arXiv preprint arXiv:1511.05952 , year=

  43. [43]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    Addressing Function Approximation Error in Actor-Critic Methods , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

  44. [44]

    Nature , volume=

    Reinforcement learning improves behaviour from evaluative feedback , author=. Nature , volume=. 2015 , publisher=

  45. [45]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    PIDNet: A real-time semantic segmentation network inspired by PID controllers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  46. [46]

    Continuous control with deep reinforcement learning

    Continuous control with deep reinforcement learning , author=. arXiv preprint arXiv:1509.02971 , year=

  47. [47]

    International conference on machine learning , pages=

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

  48. [48]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Human-guided reinforcement learning with sim-to-real transfer for autonomous navigation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2023 , publisher=

  49. [49]

    IEEE Robotics and Automation Letters , year=

    NeuTRL: Neural Trust-Guided Reinforcement Learning for Human-Robot Collaboration , author=. IEEE Robotics and Automation Letters , year=

  50. [50]

    IEEE Internet of Things Journal , year=

    Trust-calibrated human-in-the-loop reinforcement learning for safe and efficient autonomous navigation , author=. IEEE Internet of Things Journal , year=

  51. [51]

    IEEE Robotics and Automation Letters , volume=

    Human-guided robot behavior learning: A gan-assisted preference-based reinforcement learning approach , author=. IEEE Robotics and Automation Letters , volume=. 2021 , publisher=

  52. [52]

    IEEE Transactions on Transportation Electrification , year=

    Model-Free Control Framework for Stability and Path-tracking of Autonomous Independent-Drive Vehicles , author=. IEEE Transactions on Transportation Electrification , year=

  53. [53]

    Expert Systems with Applications , pages=

    Flexible anchor-based trajectory prediction for different types of traffic participants in autonomous driving systems , author=. Expert Systems with Applications , pages=. 2025 , publisher=

  54. [54]

    Neurocomputing , pages=

    Multi-modality 3D object detection in autonomous driving: A review , author=. Neurocomputing , pages=. 2023 , publisher=

  55. [55]

    IEEE Transactions on Transportation Electrification , volume=

    Auto-tuning dynamics parameters of intelligent electric vehicles via Bayesian optimization , author=. IEEE Transactions on Transportation Electrification , volume=. 2023 , publisher=

  56. [56]

    Advances in Neural Information Processing Systems , volume=

    Widening the pipeline in human-guided reinforcement learning with explanation and context-aware data augmentation , author=. Advances in Neural Information Processing Systems , volume=

  57. [57]

    End-to-end autonomous driving: Challenges and frontiers,

    End-to-end autonomous driving: Challenges and frontiers , author=. arXiv preprint arXiv:2306.16927 , year=

  58. [58]

    International conference on machine learning , pages=

    Off-policy deep reinforcement learning without exploration , author=. International conference on machine learning , pages=. 2019 , organization=

  59. [59]

    IEEE Transactions on Intelligent Transportation Systems , volume=

    Coordination control strategy for human-machine cooperative steering of intelligent vehicles: A reinforcement learning approach , author=. IEEE Transactions on Intelligent Transportation Systems , volume=. 2022 , publisher=

  60. [60]

    IEEE Transactions on Intelligent Transportation Systems , volume=

    Learning to drive like human beings: A method based on deep reinforcement learning , author=. IEEE Transactions on Intelligent Transportation Systems , volume=. 2021 , publisher=

  61. [61]

    IEEE transactions on neural networks and learning systems , volume=

    Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors , author=. IEEE transactions on neural networks and learning systems , volume=. 2021 , publisher=

  62. [62]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Learning complex dexterous manipulation with deep reinforcement learning and demonstrations , author=. arXiv preprint arXiv:1709.10087 , year=

  63. [63]

    IEEE Sensors Journal , year=

    Utilizing a diffusion model for pedestrian trajectory prediction in semi-open autonomous driving environments , author=. IEEE Sensors Journal , year=

  64. [64]

    2022 6th CAA International Conference on Vehicular Control and Intelligence (CVCI) , pages=

    Tracking Control for Autonomous Four-Wheel Independently Driven Vehicle Based on Deep Reinforcement Learning , author=. 2022 6th CAA International Conference on Vehicular Control and Intelligence (CVCI) , pages=. 2022 , organization=

  65. [65]

    IEEE Transactions on Intelligent Transportation Systems , year=

    Safety-aware human-in-the-loop reinforcement learning with shared control for autonomous driving , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  66. [66]

    IEEE Internet of Things Journal , year=

    Ethical Alignment Decision-Making for Connected Autonomous Vehicle in Traffic Dilemmas via Reinforcement Learning From Human Feedback , author=. IEEE Internet of Things Journal , year=

  67. [67]

    2024 , issn =

    LearningEMS: A Unified Framework and Open-source Benchmark for Learning-based Energy Management of Electric Vehicles , journal =. 2024 , issn =

  68. [68]

    IEEE Transactions on Industrial Informatics , volume=

    Hybrid electric vehicle energy management with computer vision and deep reinforcement learning , author=. IEEE Transactions on Industrial Informatics , volume=. 2020 , publisher=

  69. [69]

    Automotive Innovation , pages=

    Safe Reinforcement Learning-Based Eco-driving Strategy for Connected Electric Vehicles at Signalized Intersection , author=. Automotive Innovation , pages=. 2025 , publisher=

  70. [70]

    IEEE Sensors Journal , volume=

    Using a diffusion model for pedestrian trajectory prediction in semi-open autonomous driving environments , author=. IEEE Sensors Journal , volume=. 2024 , publisher=

  71. [71]

    IEEE Transactions on Intelligent Transportation Systems , year=

    Eliminating uncertainty of driver’s social preferences for lane change decision-making in realistic simulation environment , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  72. [72]

    IEEE Transactions on Intelligent Transportation Systems , year=

    Toward human-vehicle collaboration for automated vehicles: A review and perspective , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  73. [73]

    IEEE Internet of Things Journal , year=

    Trust-Calibrated Human-in-the-Loop Reinforcement Learning for Safe and Efficient Autonomous Navigation , author=. IEEE Internet of Things Journal , year=

  74. [74]

    Proceedings of the 40th International Conference on Machine Learning , series =

    LipsNet: A Smooth and Robust Neural Network with Adaptive Lipschitz Constant for High Accuracy Optimal Control , author =. Proceedings of the 40th International Conference on Machine Learning , series =. 2023 , publisher =

  75. [75]

    Proceedings of the 37th International Conference on Machine Learning , series =

    Deep Reinforcement Learning with Robust and Smooth Policy , author =. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , publisher =

  76. [76]

    arXiv preprint arXiv:2012.06644 , year =

    Regularizing Action Policies for Smooth Control with Reinforcement Learning , author =. arXiv preprint arXiv:2012.06644 , year =

  77. [77]

    arXiv preprint arXiv:2512.10926 , year =

    Decoupled Q-Chunking , author =. arXiv preprint arXiv:2512.10926 , year =

  78. [78]

    and Precup, Doina and Singh, Satinder , journal =

    Sutton, Richard S. and Precup, Doina and Singh, Satinder , journal =. Between. 1999 , doi =

  79. [79]

    Proceedings of the

    The Option-Critic Architecture , author =. Proceedings of the. 2017 , url =

  80. [80]

    Proceedings of the 36th International Conference on Machine Learning , series =

    On the Spectral Bias of Neural Networks , author =. Proceedings of the 36th International Conference on Machine Learning , series =. 2019 , publisher =

Showing first 80 references.