arxiv: 2605.10044 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

Yongjae Shin , Jongseong Chae , Seongmin Kim , Jongeui Park , Youngchul Sung

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords adaptive action chunkingreinforcement learningQ-value estimationtransformer architectureoffline-to-online RLaction sequencesimitation learningdynamic chunk length

0 comments

The pith

ACH lets RL agents dynamically choose action sequence lengths by estimating values for all candidates in one Transformer forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Adaptive Action Chunking as an offline-to-online reinforcement learning algorithm that removes the need for a fixed action sequence length. Instead of committing to one chunk size for an entire task, the method uses a Transformer to compute Q-values for every possible chunk length at the current state in a single forward pass. The agent then picks the length that yields the highest estimated value. This adaptation happens during both training and inference. On 34 challenging tasks the approach yields higher returns and faster learning than any fixed-length baseline.

Core claim

ACH dynamically modulates chunk length by simultaneously estimating action-values for all candidate chunk lengths in a single forward pass of a Transformer-based architecture, allowing the agent to select the most effective chunk length adaptively based on the current state.

What carries the argument

Multi-chunk Q-value estimation, in which one Transformer forward pass produces separate Q-value estimates for every candidate chunk length so the policy can choose the best length per state.

If this is right

Agents can improve behavioral consistency and reduce bootstrapping errors by adapting chunk length to the current state.
The same architecture supports both offline pre-training and online fine-tuning without extra training signals.
Performance improves on 34 diverse tasks, indicating better generalization across environments with varying optimal horizons.
Training and inference cost remain comparable to a single fixed-length model because all estimates share one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower the cost of hyperparameter search by removing the need to tune chunk length in advance.
The single-pass multi-scale estimation pattern may transfer to other sequence-decision problems such as option discovery or hierarchical RL.
In real-time control settings the adaptive choice might reduce unnecessary commitment to long plans when the environment changes rapidly.

Load-bearing premise

A single Transformer forward pass can produce accurate and non-interfering Q-value estimates for every candidate chunk length at once.

What would settle it

An ablation that replaces the shared Transformer head with independent Q-heads for each chunk length and measures whether the performance gap over fixed-length baselines disappears.

Figures

Figures reproduced from arXiv: 2605.10044 by Jongeui Park, Jongseong Chae, Seongmin Kim, Yongjae Shin, Youngchul Sung.

**Figure 2.** Figure 2: Action value as a function of chunk length ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The overall architecture of ACH: (left) causal Transformer-based joint value learning and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Learning curves over 1M online steps across robomimic tasks. The performance is averaged [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Behavioral comparison of ACH and fixed-length baselines in the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Experimental comparison of ACH against alternative training and sampling strategies. Each plot presents the aggregated performance across five distinct tasks. Ablation Study Section 4 presents the rationale for the value-based chunk length selection criterion (Section 4.1), describes the training of the value function and policy (Section 4.2), and details the adaptive chunk length sampling procedure (S… view at source ↗

**Figure 7.** Figure 7: Evolution of average chunk length during the online learning phase on puzzle-4x4. Evolution of Chunk Length In Off2On RL, the primary role of online learning is to enable the agent to refine its behavior through direct environment interaction. While methods with a fixed chunk length are limited to adjusting individual actions, our approach facilitates both behavioral correction and chunk length optimizat… view at source ↗

**Figure 8.** Figure 8: Comparison of computational costs between the proposed method and baselines on the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Behavioral comparison of ACH and fixed-length baselines in the [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Learning curves during online learning for all tasks in OGBench. Solid lines represent the [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACH's main move is estimating Q-values for multiple chunk lengths in one Transformer pass to pick adaptively, and the 34-task results suggest it helps, but the shared-network estimates need evidence they stay accurate and comparable.

read the letter

The paper's core idea is to drop the fixed chunk length by running Q-value estimates for several candidate lengths through the same Transformer in one forward pass, then letting the agent pick the length with the highest Q for the current state. This turns chunk selection into a state-dependent choice during both training and inference in an offline-to-online RL setup. It is a direct response to the practical limit that fixed chunks create when the best length changes across states or tasks. The evaluation covers 34 tasks and reports consistent gains over fixed-length baselines in generalization and learning speed, which is a reasonable scale for this line of work and gives the claim some weight. The method stays within standard Transformer RL patterns, so the math does not introduce obvious circularity or new fitting tricks. The citation pattern looks ordinary for action chunking and offline RL papers. The main soft spot is exactly the shared forward pass. Nothing described shows how the network keeps the different-length Q estimates from interfering through common attention weights or hidden states. Longer chunks could pull the representation one way while shorter ones pull another, and without separate heads, per-length calibration, or ablations that isolate the multi-chunk head, it is hard to tell whether the reported wins come from genuine adaptivity or from the architecture shift itself. The abstract gives no training details or statistical tests, so the strength of the outperformance is still provisional. This paper is for people already working on chunked policies in long-horizon control or imitation-to-RL transfer. A reader who needs incremental improvements to fixed-chunk methods will find the task coverage useful even if they want tighter controls on the estimation step. It deserves a serious referee because the problem is real and the experimental breadth is enough to justify closer inspection, though the review will likely focus on validating the simultaneous Q estimates.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Adaptive Action Chunking (ACH), an offline-to-online RL algorithm that dynamically selects action chunk length at each state by estimating Q-values for all candidate lengths simultaneously via a single Transformer forward pass and choosing the length with the highest Q-value. It claims this adaptive mechanism yields consistent outperformance over fixed-length chunking baselines on 34 challenging tasks, with gains in generalization and learning efficiency.

Significance. If the multi-chunk Q-estimation mechanism proves reliable, the approach could meaningfully advance action chunking in RL by removing the fixed-length bottleneck that limits prior methods. The single-pass architecture offers computational efficiency, and the scale of evaluation (34 tasks) provides a reasonable testbed for generalization claims.

major comments (2)

[Method (multi-chunk Q estimation) and Experiments] The core technical claim (abstract and method section) rests on the Transformer producing accurate, comparable, and non-interfering Q-value estimates for multiple chunk lengths in one forward pass. No ablations, separate-head controls, auxiliary disentanglement losses, or per-length calibration experiments are described to isolate whether representational interference occurs or whether selection is driven by the adaptive mechanism rather than other architectural factors.
[Experiments and Results] The experimental results (abstract) report consistent outperformance on 34 tasks but provide no details on run-to-run variance, statistical significance tests, or the exact composition of the fixed-length baselines (e.g., which lengths were tested and how they were chosen). This makes it difficult to assess whether the reported gains are robust or sensitive to hyperparameter choices.

minor comments (2)

[Abstract] The abstract would benefit from explicitly stating the set of candidate chunk lengths considered and the precise Transformer architecture (number of layers, attention heads, output heads) to aid reproducibility.
[Method] Notation for the multi-chunk Q-function and the selection rule should be formalized with equations in the method section rather than left at a high-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the multi-chunk Q estimation and experimental details.

read point-by-point responses

Referee: [Method (multi-chunk Q estimation) and Experiments] The core technical claim (abstract and method section) rests on the Transformer producing accurate, comparable, and non-interfering Q-value estimates for multiple chunk lengths in one forward pass. No ablations, separate-head controls, auxiliary disentanglement losses, or per-length calibration experiments are described to isolate whether representational interference occurs or whether selection is driven by the adaptive mechanism rather than other architectural factors.

Authors: We agree that additional controls would better isolate the contribution of the joint estimation mechanism. In the revised manuscript we will add an ablation comparing the shared Transformer against a multi-head architecture with separate Q-heads per chunk length, along with an analysis of Q-value correlations across lengths to check for interference. These results will clarify whether adaptive selection is driven by the proposed mechanism. revision: yes
Referee: [Experiments and Results] The experimental results (abstract) report consistent outperformance on 34 tasks but provide no details on run-to-run variance, statistical significance tests, or the exact composition of the fixed-length baselines (e.g., which lengths were tested and how they were chosen). This makes it difficult to assess whether the reported gains are robust or sensitive to hyperparameter choices.

Authors: We acknowledge the need for greater experimental transparency. The revised version will report run-to-run variance with standard deviations across multiple seeds, include statistical significance tests comparing ACH to the fixed-length baselines, and explicitly describe the baseline chunk lengths tested together with the selection criteria used. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic proposal with no self-referential derivations or fitted predictions.

full rationale

The paper introduces ACH as a novel offline-to-online RL algorithm that uses a Transformer to simultaneously estimate Q-values for multiple chunk lengths in one forward pass, then selects the best length adaptively. No equations, derivations, or parameter-fitting steps are described that reduce to prior quantities by construction. The method is presented as an empirical algorithm evaluated on 34 tasks, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from the authors' prior work. The central claim (adaptive chunking via multi-chunk Q estimation) is independent of its own outputs and does not rename known results or call fitted inputs predictions. This is a standard non-circular algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; standard RL assumptions (Markov property, value-function approximation) are implicitly used but not stated.

pith-pipeline@v0.9.0 · 5471 in / 1129 out tokens · 39778 ms · 2026-05-12T02:24:48.153441+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

[1]

Michael Albergo and Eric Vanden-Eijnden. 2023. Building Normalizing Flows with Stochastic Interpolants. InThe Eleventh International Conference on Learning Representations

work page 2023
[2]

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. 2023. Efficient online reinforce- ment learning with offline data. InInternational Conference on Machine Learning. PMLR, 1577–1594

work page 2023
[3]

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. 2025. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 2024. π0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Kevin Black, Manuel Y Galliker, and Sergey Levine. [n. d.]. Real-time execution of action chunking flow policies, 2025.URL https://arxiv. org/abs/2506.07339([n. d.])

work page internal anchor Pith review arXiv 2025
[6]

Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, and Youngchul Sung. [n. d.]. Flow Actor-Critic for Offline Reinforcement Learning. InThe Fourteenth Interna- tional Conference on Learning Representations

work page
[7]

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. 2025. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research44, 10-11 (2025), 1684–1704

work page 2025
[8]

Perry Dong, Qiyang Li, Dorsa Sadigh, and Chelsea Finn. 2025. Expo: Stable reinforcement learning with expressive policies.arXiv preprint arXiv:2507.07986(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. 2023. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573(2023)

work page internal anchor Pith review arXiv 2023
[10]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems33 (2020), 6840–6851

work page 2020
[12]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. 2025. π0.5: a Vision- Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, and Yuke Zhu. 2026. DEAS: DEtached value learning with Action Sequence for Scalable Offline RL. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum? id=bVTaAXeBmE

work page 2026
[14]

Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings

work page 2015
[16]

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2022. Offline Reinforcement Learning with Implicit Q-Learning. InInternational Conference on Learning Representations

work page 2022
[17]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems33 (2020), 1179–1191. 10

work page 2020
[18]

Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. 2022. Offline-to- online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning. PMLR, 1702–1712

work page 2022
[19]

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, and Gerhard Neumann. 2025. TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning. InThe Thirteenth International Conference on Learning Representations

work page 2025
[21]

Qiyang Li, Seohong Park, and Sergey Levine. 2026. Decoupled Q-Chunking. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum? id=aqGNdZQL9l

work page 2026
[22]

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. 2025. Reinforcement Learning with Action Chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[23]

Yuanchang Liang, Shikai Li, Shiji Song, and Gao Huang. 2026. Adaptive Action Chunking at Inference-time for Vision-Language-Action Models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). arXiv preprint arXiv:2604.04161

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow Matching for Generative Modeling. InThe Eleventh International Conference on Learning Representations

work page 2023
[25]

Qin-Wen Luo, Ming-Kun Xie, Yewen Wang, and Sheng-Jun Huang. 2024. Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online RL.Advances in Neural Information Processing Systems37 (2024), 108167–108207

work page 2024
[26]

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. 2021. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In5th Annual Conference on Robot Learning

work page 2021
[27]

CF Nagy, Onur Celik, Emiliyan Gospodinov, Florian Seligmann, Weiran Liao, Aryan Kaushik, and Gerhard Neumann. 2026. SEAR: Sample Efficient Action Chunking Reinforcement Learning.arXiv preprint arXiv:2603.01891(2026)

work page arXiv 2026
[28]

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. 2023. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems36 (2023), 62244–62269

work page 2023
[29]

Kwanyoung Park, Seohong Park, Youngwoon Lee, and Sergey Levine. 2026. Scalable Offline Model-Based RL with Action Chunks. InThe Fourteenth International Conference on Learning Representations.https://openreview.net/forum?id=WXGb9unEHo

work page 2026
[30]

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. 2025. OGBench: Benchmarking Offline Goal-Conditioned RL. InThe Thirteenth International Conference on Learning Representations

work page 2025
[31]

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. 2025. Horizon Reduction Makes RL Scalable. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[32]

Seohong Park, Qiyang Li, and Sergey Levine. 2025. Flow Q-Learning. InInternational Conference on Machine Learning (ICML)

work page 2025
[33]

Kohei Sendai, Maxime Alvarez, Tatsuya Matsushima, Yutaka Matsuo, and Yusuke Iwasawa

work page
[34]

Leave no observation behind: Real-time correction for vla action chunks.arXiv preprint arXiv:2509.23224(2025). 11

work page arXiv 2025
[35]

Younggyo Seo and Pieter Abbeel. 2025. Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum?id=VoFXUNc9Zh

work page 2025
[36]

Yongjae Shin, Jongseong Chae, Jongeui Park, and Youngchul Sung. 2026. Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning. InThe Fourteenth Inter- national Conference on Learning Representations. https://openreview.net/forum?id= 6wd38R8L0Z

work page 2026
[37]

Yongjae Shin, Jeonghye Kim, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngsoo Jang, Geon-Hyeong Kim, Jongseong Chae, Youngchul Sung, Kanghoon Lee, and Woohyung Lim. 2025. Online Pre-Training for Offline-to-Online Reinforcement Learning. InForty-second International Conference on Machine Learning

work page 2025
[38]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning. pmlr, 2256–2265

work page 2015
[39]

Gwanwoo Song, Kwanyoung Park, and Youngwoon Lee. 2026. Chunk-Guided Q-Learning. arXiv preprint arXiv:2603.13971(2026)

work page arXiv 2026
[40]

1998.Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. 1998.Reinforcement learning: An introduction. V ol. 1. MIT press Cambridge

work page 1998
[41]

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. 2023. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 11592–11620

work page 2023
[42]

Dong Tian, Onur Celik, and Gerhard Neumann. 2026. Chunking the Critic: A Transformer- based Soft Actor-Critic with N-Step Returns. InThe Fourteenth International Conference on Learning Representations.https://openreview.net/forum?id=rb5eTktqbc

work page 2026
[43]

Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Lin, Hao Chen, Liwei Wu, Ning Jia, Shiji Song, and Gao Huang. 2023. Train once, get a family: State-adaptive balances for offline-to- online reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 47081–47104

work page 2023
[44]

Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, and Mingsheng Long. 2022. Supported policy optimization for offline reinforcement learning.Advances in Neural Information Processing Systems35 (2022), 31278–31291

work page 2022
[45]

Jiarui Yang, Bin Zhu, Jingjing Chen, and Yu-Gang Jiang. 2026. Actor-critic for continuous action chunks: A reinforcement learning framework for long-horizon robotic manipulation with sparse reward. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 40. 18692–18700

work page 2026
[46]

Zishun Yu and Xinhua Zhang. 2023. Actor-critic alignment for offline-to-online reinforcement learning. InInternational Conference on Machine Learning. PMLR, 40452–40474

work page 2023
[47]

Haichao Zhang, Wei Xu, and Haonan Yu. 2023. Policy Expansion for Bridging Offline- to-Online Reinforcement Learning. InThe Eleventh International Conference on Learning Representations

work page 2023
[48]

Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu, and Wanli Ouyang

work page
[49]

In Proceedings of the AAAI Conference on Artificial Intelligence, V ol

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 38. 16908–16916

work page
[50]

Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.Robotics: Science and Systems XIX(2023)

work page 2023
[51]

Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. 2024. Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data. InThe Thirteenth International Conference on Learning Representations. 12 A Limitations Since the proposed method employs a Transformer as the value function, which is relatively heavier than the MLP ...

work page 2024