arxiv: 2604.22229 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

Recognition: unknown

Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

Zhancun Mu , Guangyu Zhao , Yiwu Zhong , Chi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningone-step actordynamic routinglatent spacebehavior cloningcritic guidanceOGBenchD4RL

0 comments

The pith

Dynamic routing of dataset actions to multiple latent candidates lets one-step offline RL actors improve locally on supported actions instead of fixing single correspondences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DROL to train a latent-conditioned one-step actor with top-1 dynamic routing rather than pointwise extraction. For each state the method draws K candidate actions from a bounded latent prior, assigns every dataset action to its nearest candidate, and updates only the winning candidate using both behavior cloning and critic guidance. Because the assignment is recomputed from the current candidate positions at every step, ownership of data-supported regions can migrate between candidates during training. This lets the actor pursue locally better actions that remain supported by the dataset while preserving single-pass inference at test time. A reader would care because one-step actors are attractive for cheap deployment yet often lose performance when fixed pairings force compromises between the critic and the data support.

Core claim

DROL trains a one-step actor by sampling K candidate actions from a bounded latent prior for each state, assigning each dataset action to its nearest candidate via dynamic top-1 routing, and updating only that winner with a combination of behavior cloning and critic guidance; recomputing the assignments from the evolving candidate geometry allows ownership of supported regions to shift across candidates, giving the actor room for local improvements that fixed pointwise extraction cannot capture.

What carries the argument

Top-1 dynamic routing, which reassigns each dataset action to the nearest of K latent candidates at every update using the current candidate geometry.

If this is right

DROL remains competitive with the one-step FQL baseline on OGBench and D4RL while improving several OGBench task groups.
The method keeps single-pass inference at test time.
Ownership of supported regions can migrate between candidates as learning proceeds.
Local improvements become possible even when the critic direction and the nearest data point disagree on a given sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reassignment idea could be tested in other offline settings that have multimodal action support per state.
Dynamic routing might reduce reliance on a strong iterative teacher during extraction.
On tasks with changing data support over training, the method could maintain coverage better than static pairings.

Load-bearing premise

Recomputing nearest-candidate assignments from the current latent geometry will reliably let ownership of supported regions shift and produce local improvements without the actor drifting away from dataset-supported actions.

What would settle it

Measure whether DROL obtains higher returns than a pointwise baseline on tasks whose action distributions contain multiple supported modes per state and where the critic prefers an action different from the nearest fixed match.

Figures

Figures reproduced from arXiv: 2604.22229 by Chi Zhang, Guangyu Zhao, Yiwu Zhong, Zhancun Mu.

**Figure 1.** Figure 1: Preserve support, not correspondence. Left: pointwise extraction assigns both improvement view at source ↗

**Figure 2.** Figure 2: Mechanism visualization for DROL. The left panel shows structure construction, in which view at source ↗

**Figure 3.** Figure 3: Scaling and runtime of routed one-step actors. Top: training metrics versus the number of view at source ↗

**Figure 4.** Figure 4: Sensitivity to the number of candidates K across OGBench and D4RL. Each panel shows one representative sweep over K ∈ {1, 2, 4, 8, 16, 32}; the dashed line marks the default K = 16 and the red circle marks the best observed K. 6 Conclusion and Limitations DROL studies one-step actor learning in multimodal offline RL from a geometric perspective. The actor produces a small candidate set, routing selects the… view at source ↗

read the original abstract

One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent-conditioned one-step actor trained with top-1 dynamic routing. For each state, the actor samples $K$ candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner with Behavior Cloning and critic guidance. Because the routing is recomputed from the current candidate geometry, ownership of a supported region can shift across candidates over the course of learning. This gives a one-step actor room to make local improvements that pointwise extraction struggles to capture, while retaining single-pass inference at test time. On OGBench and D4RL, DROL is competitive with the one-step FQL baseline, improving many OGBench task groups while remaining strong on both AntMaze and Adroit. Project page: https://muzhancun.github.io/preprints/DROL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DROL's top-1 dynamic routing over K latent candidates lets ownership of dataset-supported regions shift during training for one-step actors, but the combined BC-critic updates may still allow drift or oscillation when assignments flip.

read the letter

The core new piece is sampling K actions from a bounded latent prior, assigning each dataset action to its nearest candidate, and updating only that winner with both behavior cloning and critic guidance. Recomputing the assignment from the evolving geometry is meant to let supported regions move between candidates instead of locking into a fixed pointwise pairing, which the authors argue creates unwanted compromises when the two losses disagree. This keeps the policy one-step at test time while giving it more room to improve locally than standard extraction pipelines. The experiments report competitive results against FQL on OGBench and D4RL, with gains on several OGBench task groups and steady performance on AntMaze and Adroit, using the usual benchmark suites for direct comparison. That is useful evidence for people who care about fast inference in robotics-style settings. The soft spot is the stability of the routing itself. Nothing in the description shows that the joint loss prevents a winner from being pulled away from its assigned data points when the critic is imperfect, which is the normal case in offline RL. Close candidates can flip assignments repeatedly, or a single winner can drift inside the bounded prior while still receiving the BC term, and the paper does not appear to include ablations tracking assignment changes or testing under weaker critics. The bounded prior constrains the range but does not obviously solve local collapse or oscillation. This work is aimed at researchers building efficient one-step offline RL actors. It deserves a serious referee because the routing mechanism is distinct from prior pointwise methods and the benchmarks are standard, even if the dynamics of the assignment process would benefit from more direct analysis.

Referee Report

2 major / 2 minor

Summary. The paper introduces DROL, a latent-conditioned one-step actor for offline RL that uses top-1 dynamic routing: for each state, K candidate actions are sampled from a bounded latent prior, each dataset action is assigned to its nearest candidate, and only the winner receives Behavior Cloning plus critic guidance. Routing is recomputed from the current candidate geometry at each step so that ownership of supported regions can shift. The method is positioned as allowing local improvements that fixed pointwise extraction cannot capture while retaining single-pass test-time inference. Experiments report competitiveness with the FQL baseline on OGBench (improving several task groups) and strong performance on D4RL AntMaze and Adroit.

Significance. If the dynamic reassignment mechanism can be shown to produce stable, productive ownership shifts without drift or oscillation, the approach would offer a principled way to relax the strict correspondence constraint in one-step offline RL while preserving support. The reported benchmark competitiveness indicates practical utility for latent-conditioned actors. The absence of a derivation or controlled ablation isolating the benefit of recomputed routing over static assignment limits the strength of the central claim.

major comments (2)

[§3] §3 (Dynamic Routing): the central claim that recomputing nearest-candidate assignments enables local improvements pointwise extraction cannot capture rests on the unproven assumption that the combined BC + critic loss on the current winner keeps that winner inside the dataset support. No analysis or bound is given showing that small critic-driven moves cannot flip assignments repeatedly or pull a winner away from its assigned dataset actions when two candidates are close or the critic is imperfect.
[§4] §4 (Experiments): the paper reports competitiveness with FQL but provides no ablation that isolates the contribution of dynamic reassignment versus a static nearest-candidate baseline, nor any analysis of assignment stability (e.g., frequency of ownership flips or distance of winners to their assigned data points) across training. Without these, it is difficult to attribute observed gains specifically to the dynamic-routing mechanism.

minor comments (2)

The abstract and method section would benefit from an explicit statement of the value of K used in all reported experiments and a brief discussion of its sensitivity.
Figure captions and experimental tables should include error bars or standard deviations over seeds to allow assessment of statistical reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on DROL. The comments correctly identify gaps in the theoretical grounding and empirical isolation of the dynamic routing mechanism. We respond to each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (Dynamic Routing): the central claim that recomputing nearest-candidate assignments enables local improvements pointwise extraction cannot capture rests on the unproven assumption that the combined BC + critic loss on the current winner keeps that winner inside the dataset support. No analysis or bound is given showing that small critic-driven moves cannot flip assignments repeatedly or pull a winner away from its assigned dataset actions when two candidates are close or the critic is imperfect.

Authors: We agree that the manuscript lacks a formal bound or stability analysis for the assignment process under the joint BC + critic objective. The bounded latent prior is intended to constrain candidate movement, but we do not derive guarantees against repeated flips or drift when candidates are proximate. In the revision we will add a dedicated discussion subsection on this issue, including a qualitative argument based on the top-1 selection and bounded support, together with new empirical measurements of assignment stability (ownership flip rates and winner-to-data distances) across training on representative tasks. While a complete theoretical guarantee remains outside the current scope, these additions will directly address the concern. revision: partial
Referee: [§4] §4 (Experiments): the paper reports competitiveness with FQL but provides no ablation that isolates the contribution of dynamic reassignment versus a static nearest-candidate baseline, nor any analysis of assignment stability (e.g., frequency of ownership flips or distance of winners to their assigned data points) across training. Without these, it is difficult to attribute observed gains specifically to the dynamic-routing mechanism.

Authors: The referee is correct that the current experiments do not isolate the dynamic component via a static-assignment control or report stability diagnostics. We will add both in the revision: (1) a static-routing baseline in which nearest-candidate assignments are computed once at initialization and held fixed, and (2) training curves and summary statistics for ownership flip frequency and average winner-to-assigned-data distance on the OGBench and D4RL suites. These results will allow direct attribution of any performance difference to the recomputation of routing. revision: yes

Circularity Check

0 steps flagged

No circularity: new dynamic routing procedure is self-contained and empirically evaluated

full rationale

The paper introduces DROL as a novel one-step actor training procedure that samples K latent candidates, performs nearest-candidate assignment of dataset actions, and applies BC + critic updates only to the current winner, with routing recomputed each step. No derivation chain, equation, or claim reduces by construction to its own inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The central benefit (allowing ownership shifts for local improvements while preserving single-pass inference) is argued directly from the recomputation mechanism and supported by benchmark results on OGBench and DRL, rather than any self-referential or ansatz-smuggled step. This is the common case of an independent algorithmic proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the assumption that nearest-neighbor assignment in a bounded latent space permits dynamic ownership shifts during learning; K is an implicit free parameter whose selection affects routing behavior.

free parameters (1)

K
Number of candidate actions sampled per state from the latent prior; its value determines the granularity of routing and is not derived from first principles.

axioms (1)

domain assumption A bounded latent prior yields candidate actions whose geometry permits meaningful nearest-neighbor assignments that can shift over training.
Invoked to justify why dynamic routing can capture local improvements without losing dataset support.

pith-pipeline@v0.9.0 · 5563 in / 1388 out tokens · 83199 ms · 2026-05-08T12:21:28.456473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Voronoi diagrams---a survey of a fundamental geometric data structure

Franz Aurenhammer. Voronoi diagrams---a survey of a fundamental geometric data structure. ACM Computing Surveys, 23 0 (3): 0 345--405, 1991. doi:10.1145/116873.116880

work page doi:10.1145/116873.116880 1991
[2]

Flow actor-critic for offline reinforcement learning

Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, and Youngchul Sung. Flow actor-critic for offline reinforcement learning. In International Conference on Learning Representations, 2026

2026
[3]

Scaling offline rl via efficient and expressive shortcut models

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models. arXiv preprint arXiv:2505.22866, 2025

work page arXiv 2025
[4]

Implicit behavioral cloning

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In Conference on Robot Learning, pages 158--168, 2022

2022
[5]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ArXiv, abs/2004.07219, 2020

work page internal anchor Pith review arXiv 2004
[6]

A minimalist approach to offline reinforcement learning

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neural Information Processing Systems (NeurIPS), 2021

2021
[7]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. ArXiv, abs/2304.10573, 2023

work page internal anchor Pith review arXiv 2023
[8]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations (ICLR), 2022

2022
[9]

Tucker, and Sergey Levine

Aviral Kumar, Aurick Zhou, G. Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems (NeurIPS), 2020

2020
[10]

Offline reinforcement learning: Tutorial, review, and perspectives on open problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint, 2020

2020
[11]

Implicit maximum likelihood estimation

Ke Li and Jitendra Malik. Implicit maximum likelihood estimation. In International Conference on Learning Representations, 2019

2019
[12]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023

2023
[13]

DeFlow : Decoupling manifold modeling and value maximization for offline policy extraction

Zhancun Mu et al. DeFlow : Decoupling manifold modeling and value maximization for offline policy extraction. arXiv preprint arXiv:2601.10471, 2026

work page arXiv 2026
[14]

2 ed., John Wiley & Sons

Atsuyuki Okabe, Barry Boots, Kokichi Sugihara, and Sung Nok Chiu. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. Wiley, 2 edition, 2000. doi:10.1002/9780470317013

work page doi:10.1002/9780470317013 2000
[15]

Ogbench: Benchmarking offline goal-conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. In International Conference on Learning Representations (ICLR), 2025 a

2025
[16]

Flow Q - Learning , May 2025 c

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. ArXiv, abs/2502.02538, 2025 b

work page arXiv 2025
[17]

IMLE Policy : Fast and sample efficient visuomotor policy learning via implicit maximum likelihood estimation

Krishan Rana, Robert Lee, David Pershouse, and Niko Suenderhauf. IMLE Policy : Fast and sample efficient visuomotor policy learning via implicit maximum likelihood estimation. In Robotics: Science and Systems, 2025

2025
[18]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015

2015
[19]

One-step generative policies with q-learning: A reformulation of meanflow

Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, and Yanwei Fu. One-step generative policies with q-learning: A reformulation of meanflow. arXiv preprint arXiv:2511.13035, 2025

work page arXiv 2025
[20]

Diffusion policies as an expressive policy class for offline reinforcement learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. In International Conference on Learning Representations (ICLR), 2023

2023
[21]

ReFORM : Reflected flows for on-support offline rl via noise reflection

Shiji Zhang et al. ReFORM : Reflected flows for on-support offline rl via noise reflection. In International Conference on Learning Representations, 2026

2026