DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies
Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3
The pith
DEFLECT turns execution delays into a label-free tuning signal that keeps VLA policies effective under asynchronous inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DEFLECT is a fully offline post-training procedure that converts the latency-induced misalignment of asynchronous VLA execution into a preference signal by scoring counterfactual fresh-versus-stale action pairs under deployment-time conditioning with a flow-matching likelihood-ratio surrogate derived from a frozen reference policy.
What carries the argument
The implicit flow-matching likelihood-ratio surrogate that scores counterfactual action pairs constructed from a frozen reference policy under the delayed conditioning observation.
If this is right
- Success rates rise by 6.4 points in the 5-7 control-step latency regime on Kinetix.
- Real-scale VLA transfer shows a 4.6-point gain at the longest tested delay.
- Two physical tasks—a bimanual conveyor pick-and-place and a reactive whack-a-mole—each improve under the same tuning procedure.
- The refinement works as a near drop-in addition to existing asynchronous VLA inference stacks.
Where Pith is reading between the lines
- Longer or larger inference models could be used without shrinking the usable control horizon.
- The same counterfactual construction might apply to other latency-sensitive control loops that lack dense reward signals.
- Hardware requirements for low-latency inference hardware could be relaxed if the tuning generalizes across delay distributions.
Load-bearing premise
That an implicit flow-matching likelihood-ratio surrogate computed from a frozen reference policy can serve as a reliable label-free preference signal for tuning without human labels, reward models, or online rollouts.
What would settle it
Running the tuned policy on the same high-delay evaluation suite and observing success rates equal to or lower than the untuned baseline would falsify the claim that the surrogate supplies useful preference information.
Figures
read the original abstract
Vision-Language-Action (VLA) policies are typically deployed with asynchronous inference: the robot executes a previously predicted action chunk while the model computes the next one. This creates a prediction-execution misalignment: the chunk is conditioned on the observation taken before inference began, but executes in a physical state that has already drifted forward by several control steps; naive asynchronous rollover collapses from 89% to under 1% on Kinetix as the inference cycle covers up to seven control steps. We introduce DEFLECT, a fully offline post-training refinement that applies as a near drop-in upgrade to existing async-VLA stacks by converting latency itself into a label-free preference signal: counterfactual fresh/stale action pairs are constructed from a frozen reference policy and scored under the deployment-time conditioning via an implicit flow-matching likelihood-ratio surrogate, with no human labels, reward models, or online rollouts. DEFLECT substantially extends the usable delay envelope of async VLA control, with +6.4 success-rate gain in the high-latency regime (5-7 control steps), +4.6 when transferred to a real-scale VLA at the longest delay, and consistent improvements on two real-robot tasks (a bimanual conveyor pick-and-place and a reactive whack-a-mole).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DEFLECT, an offline post-training method for Vision-Language-Action (VLA) policies that converts inference latency into a label-free preference signal. It constructs counterfactual fresh/stale action pairs from a frozen reference policy, scores them via an implicit flow-matching likelihood-ratio surrogate evaluated under delayed conditioning, and uses the resulting signal to tune the policy without human labels, reward models, or online rollouts. The central empirical claim is that this extends the usable delay envelope, yielding a +6.4 success-rate gain in the 5-7 control-step high-latency regime on Kinetix, a +4.6 gain when transferred to a real-scale VLA, and consistent improvements on two real-robot tasks.
Significance. If the surrogate reliably ranks actions by downstream task utility despite state drift, the result would be significant for asynchronous VLA deployment: it offers a practical, fully offline upgrade that widens the inference-time budget without additional supervision. The approach is notable for avoiding online rollouts and for framing latency itself as the source of the preference signal.
major comments (2)
- [Abstract] Abstract: the reported +6.4 success-rate gain in the 5-7 step regime and +4.6 transfer gain are presented without any mention of trial counts, standard deviations, statistical tests, or the precise baseline (naive async rollover) implementation details; these omissions make it impossible to judge whether the numeric improvements are load-bearing evidence for the method or could be explained by variance or implementation differences.
- [Method] Method description (implicit in abstract): the flow-matching model is trained on nominal (non-delayed) trajectories, yet the likelihood-ratio surrogate is evaluated on delayed conditioning; no derivation or empirical validation is supplied showing that higher surrogate likelihood under drift correlates with higher real-world success rather than with artifacts of the reference policy's training distribution.
minor comments (1)
- [Abstract] Abstract: the two real-robot tasks are named but not described (e.g., observation space, action chunk length, or how delay is emulated on hardware); adding one sentence of setup detail would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported +6.4 success-rate gain in the 5-7 step regime and +4.6 transfer gain are presented without any mention of trial counts, standard deviations, statistical tests, or the precise baseline (naive async rollover) implementation details; these omissions make it impossible to judge whether the numeric improvements are load-bearing evidence for the method or could be explained by variance or implementation differences.
Authors: We agree that the abstract should include these details to allow proper evaluation of the results. In the revised manuscript we will update the abstract to report the number of trials (50 per condition), standard deviations for the reported gains, and note that the improvements were assessed with paired t-tests against the naive asynchronous rollover baseline as defined in Section 3.1. revision: yes
-
Referee: [Method] Method description (implicit in abstract): the flow-matching model is trained on nominal (non-delayed) trajectories, yet the likelihood-ratio surrogate is evaluated on delayed conditioning; no derivation or empirical validation is supplied showing that higher surrogate likelihood under drift correlates with higher real-world success rather than with artifacts of the reference policy's training distribution.
Authors: The referee correctly notes that the manuscript lacks an explicit derivation or dedicated empirical validation of the correlation between the delayed-conditioning likelihood ratio and downstream success. The current presentation relies on the overall task-level improvements as indirect support. We will add a short derivation sketch in the appendix and include a new figure in Section 4.3 showing the correlation between surrogate scores and success rates on held-out delayed rollouts. revision: yes
Circularity Check
No significant circularity detected in DEFLECT derivation
full rationale
The paper's core construction uses a frozen reference policy to generate counterfactual fresh/stale action pairs and an implicit flow-matching likelihood-ratio surrogate for label-free preference scoring under delayed conditioning. This is an external, offline procedure whose validity is not derived from the target policy's own outputs or fitted parameters by construction. The reported empirical gains (+6.4 success rate in high-latency regime) are presented as experimental outcomes rather than tautological predictions equivalent to the method inputs. No self-definitional loops, fitted-input-as-prediction reductions, or load-bearing self-citation chains appear in the abstract or method summary that would collapse the claimed results to the inputs by definition. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The flow-matching model provides an implicit likelihood ratio that correlates with action quality under delayed conditioning.
- domain assumption Counterfactual action pairs generated from a frozen reference policy are representative of the deployment-time distribution.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
converts latency itself into a label-free preference signal: counterfactual fresh/stale action pairs are constructed from a frozen reference policy and scored under the deployment-time conditioning via an implicit flow-matching likelihood-ratio surrogate
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
flow-matching likelihood-ratio surrogate computed from a frozen reference policy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
work page 2023
-
[2]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, et al. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [4]
- [5]
- [6]
-
[7]
Y . Liu, J. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via guided test-time sampling. InInternational Conference on Learning Rep- resentations, volume 2025, pages 4594–4627, 2025
work page 2025
- [8]
-
[9]
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
Y . Jiang, S. Cheng, Y . Ding, F. Gao, and B. Qi. AsyncVLA: Asynchronous flow matching for vision-language-action models.arXiv preprint arXiv:2511.14148, 2025. 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
W. Xia, Y . Yang, H. Wu, X. Ma, T. Kong, and D. Hu. Human-assisted robotic policy refinement via action preference optimization.Advances in Neural Information Processing Systems, 38: 36746–36768, 2026
work page 2026
- [11]
-
[12]
M. Kim, Y . Lee, S. Kang, J. Oh, S. Chong, and S.-Y . Yun. Preference alignment with flow matching. InAdvances in Neural Information Processing Systems, volume 37, pages 35140– 35164, 2024
work page 2024
-
[13]
B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228– 8238, 2024
work page 2024
-
[14]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct prefer- ence optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[15]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025
D. McAllister, S. Ge, B. Yi, C. M. Kim, E. Weber, H. Choi, H. Feng, and A. Kanazawa. Flow matching policy gradients, 2025. URLhttps://arxiv.org/abs/2507.21053
-
[18]
M. Matthews, M. Beukman, C. Lu, and J. Foerster. Kinetix: Investigating the training of general agents through open-ended physics-based control tasks. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[19]
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[20]
Y . Lu, Z. Liu, X. Fan, Z. Yang, J. Hou, J. Li, K. Ding, and H. Zhao. Faster: Rethinking real-time flow vlas, 2026. URLhttps://arxiv.org/abs/2603.19199
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, Y .-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URLhttps://arxiv.org/ abs/2510.10274
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [22]
-
[23]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023
work page 2023
-
[25]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10–11):1684–1704, 2025. 10
work page 2025
-
[26]
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. RDT-1B: A dif- fusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025
work page 2025
- [27]
-
[28]
X. Li, H. Tang, X. Ding, W. Wang, T. Cao, and Y . Liu. OxyGen: Unified KV cache management for vision-language-action models under multi-task parallelism.arXiv preprint arXiv:2603.14371, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [29]
-
[30]
S. Ramstedt and C. Pal. Real-time reinforcement learning. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[31]
P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[32]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welin- der, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022
work page 2022
- [33]
-
[34]
M. G. Azar, M. Rowland, B. Piot, D. Guo, D. Calandriello, M. Valko, and R. Munos. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
work page 2024
- [35]
-
[36]
KTO: Model Alignment as Prospect Theoretic Optimization
K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela. KTO: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [37]
-
[38]
M. Moletta, M. C. Welle, and D. Kragic. Preference aligned visuomotor diffusion policies for deformable object manipulation.arXiv preprint arXiv:2602.09583, 2026. 11 A Implementation Details Preference-pair generation.BothA + andA − are produced online from the frozen VLASH reference policy,notfrom the dataset, so the pair always lives on the deployable a...
-
[39]
optimizes. Step 4: Monte Carlo realization.At training time, the expectations over(τ, ϵ)inL FM are es- timated with one sample per chunk, shared betweenA + andA − and betweenθandref(line 5 of Algorithm 1). Sharing(τ, ϵ)across the four terms cancels the within-margin sampling noise to leading order; this is the same variance-reduction step used in Diffusio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.