arxiv: 2603.17834 · v2 · submitted 2026-03-18 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

Zunzhe Zhang , Runhan Huang , Yicheng Liu , Shaoting Zhu , Linzhan Mou , Hang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords flow matchingrobotic imitation learningadaptive controlvelocity fieldout-of-distribution detectiongenerative policiesaction optimization

0 comments

The pith

A stationary velocity field turns flow matching into adaptive optimization for robotic control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Generative Control as Optimization (GeCO), a method that learns a stationary velocity field over action sequences instead of relying on fixed-time integration in flow matching. Expert behaviors act as attractors in this field, so inference can run until convergence rather than a preset number of steps. This setup lets the policy spend less computation on straightforward states and more on complex ones. The same field norm also acts as a built-in detector for states outside the training distribution. The approach is tested as a drop-in replacement for standard flow-matching components in robot policies and vision-language-action models.

Core claim

GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence—exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies.

What carries the argument

The stationary velocity field over action sequences that has expert behaviors as its stable attractors.

If this is right

Test-time computation varies with task difficulty instead of staying fixed.
Success rates improve on standard benchmarks when used in place of conventional flow matching.
The field norm provides an immediate, training-free signal for detecting out-of-distribution inputs.
Scaling to large vision-language-action models occurs without changing the training procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adaptive stopping rule may reduce energy use on embedded robot hardware for routine motions.
Combining the norm signal with other uncertainty measures could strengthen safety layers in deployed systems.
The attractor geometry might extend to multi-task settings where different expert behaviors compete as separate basins.

Load-bearing premise

A learned stationary velocity field will reliably form stable attractors for expert behaviors across the state distribution and the field norm will serve as a robust OOD detector without excessive false positives or negatives in real robotic deployments.

What would settle it

Compare the velocity field norm on in-distribution versus out-of-distribution states during policy execution and check whether early stopping based on convergence preserves task success rates across varying state difficulties.

Figures

Figures reproduced from arXiv: 2603.17834 by Hang Zhao, Linzhan Mou, Runhan Huang, Shaoting Zhu, Yicheng Liu, Zunzhe Zhang.

**Figure 1.** Figure 1: Generative Control as Optimization (GeCO). (A) The Paradigm Shift: Unlike standard flow matching which relies on rigid, time-dependent integration schedules (top), GeCO learns a stationary velocity field where inference becomes an iterative optimization process toward stable attractors (bottom). (B) Adaptive Computation: This formulation enables the policy to dynamically allocate computational budget based… view at source ↗

**Figure 2.** Figure 2: Computation Follows Task Complexity. We visualize the spatial distribution of inference effort along a single rollout. The first three panels (a–c) are sampled from LIBERO-Spatial, and the last three panels (d–f) are from LIBERO-Object. The color of each line encodes the number of function evaluations (NFE) required for convergence at that state, ranging from blue (NFE = 1) to red (NFE = 20). This visualiz… view at source ↗

**Figure 3.** Figure 3: GeCO policy execution for the Nut Assembly task. The robot performs high-precision alignment and rotational insertion. 1 2 3 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: GeCO policy execution for the Chemistry Tube Arrangement task. The policy adaptively handles the tight-tolerance insertion of fragile tubes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Task setups for the real-world robotic deployment, showing the configurations for both the Nut Assembly and the Chemistry Tube Arrangement tasks. • Nut Assembly: The robot shall grasp a plastic nut, accurately align it with a threaded bolt, and perform the insertion. A trial is considered successful only when both two nuts are fully threaded onto their respective bolts. • Chemistry Tube Arrangement: The ro… view at source ↗

**Figure 6.** Figure 6: Adaptive computation during real-world execution. We visualize the per-step Number of Function Evaluations (NFE) and total inference time over time for representative physical rollouts of Nut Assembly and Tube Arrangement. For Nut Assembly, GeCO uses an average of 3.50 NFEs and 166.7 ms total inference time per decision. For Tube Arrangement, GeCO uses an average of 3.73 NFEs and 147.0 ms per decision. Rat… view at source ↗

read the original abstract

Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence--exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeCO reframes flow matching as optimization over a stationary velocity field for adaptive compute and a free OOD signal in robotic policies, but the attractor stability needs direct verification.

read the letter

GeCO's main move is to drop time conditioning from flow matching so that inference becomes iterative optimization over a fixed velocity field in action-sequence space. Expert actions are meant to act as attractors, letting the process stop early on easy states and run longer on hard ones while the field norm flags anomalies without extra training. This directly tackles the fixed-budget waste in standard diffusion or flow policies for robotics. The framing is clean and the potential payoff is practical: variable compute plus an intrinsic safety check that could plug into existing heads like those in pi0 VLAs. The abstract reports gains on simulation benchmarks and seamless scaling, which suggests the idea is at least implementable. The soft spot is whether the learned stationary field actually produces stable fixed points at the expert actions. Flow matching trains conditional velocities along paths, but removing the time input does not automatically enforce v=0 on the data or guarantee attractive dynamics. Without explicit regularization or post-training checks, the optimizer could settle at wrong points or fail to converge reliably, which would weaken both the early-exit benefit and the OOD claim. The provided text gives no experimental controls, error bars, or ablation on convergence behavior, so those results stay plausible rather than confirmed. This paper is for people already running generative imitation policies who care about deployment efficiency and safety monitoring. It deserves a serious referee because the reformulation is coherent and targets a concrete inefficiency, even if the empirical section will need tighter validation on attractor properties and statistical robustness.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Generative Control as Optimization (GeCO), a time-unconditional flow matching framework for robotic imitation learning. It learns a stationary velocity field in action-sequence space such that expert behaviors become stable attractors. This allows test-time inference via iterative optimization that adapts computation to task difficulty, with early exit for simple states, and uses the velocity field norm as a training-free OOD detector. The approach is presented as a plug-and-play replacement for standard flow-matching heads, validated on simulation benchmarks and scaled to pi0 VLA models, claiming improvements in success rates and efficiency.

Significance. Should the stationary velocity field reliably produce stable attractors for expert actions, this work would offer a meaningful advance in efficient and robust robotic control policies. The adaptive inference and training-free safety signal address important practical limitations of current generative models in robotics, potentially reducing computational overhead and enhancing deployment safety.

major comments (2)

[Abstract] Abstract and method description: The central claim that the learned time-unconditional velocity field v(a,s) forms stable attractors requires v(a*,s)=0 at expert actions a* with attractive dynamics under da/dt = v(a,s). Standard flow-matching training matches velocities along paths but does not impose v=0 on the data manifold or Lyapunov stability when time conditioning is removed, so iterative optimization may converge to incorrect points. This directly undermines the adaptive early-exit mechanism and the claim that ||v(a_opt,s)|| is a reliable OOD detector.
[Experiments] Experimental section: The abstract asserts validation on simulation benchmarks and scaling to pi0 VLA models with improved success rates, but provides no details on experimental controls, baselines, error bars, statistical significance, or ablation studies on the optimization procedure. This leaves the quantitative claims unverified and the soundness of the adaptive and safety benefits difficult to assess.

minor comments (2)

[Method] The manuscript should clarify the exact form of the time-unconditional loss and any regularization terms used to encourage fixed points at data.
[Figures] Figure captions and pseudocode for the test-time optimization loop would improve clarity on how convergence-based early exit is implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to provide additional theoretical clarification and experimental details.

read point-by-point responses

Referee: [Abstract] Abstract and method description: The central claim that the learned time-unconditional velocity field v(a,s) forms stable attractors requires v(a*,s)=0 at expert actions a* with attractive dynamics under da/dt = v(a,s). Standard flow-matching training matches velocities along paths but does not impose v=0 on the data manifold or Lyapunov stability when time conditioning is removed, so iterative optimization may converge to incorrect points. This directly undermines the adaptive early-exit mechanism and the claim that ||v(a_opt,s)|| is a reliable OOD detector.

Authors: We appreciate this important theoretical point. In the GeCO formulation, the stationary velocity field is obtained by regressing to the time-averaged target velocity from the flow-matching objective; this construction yields v(a*,s) = 0 at expert actions a* by design, as the data distribution is the equilibrium of the learned dynamics. While global Lyapunov stability is not guaranteed by the basic regression loss alone, local attractivity around expert actions is supported by our convergence analysis and empirical results on both simulation and VLA tasks. To address the concern directly, we will add a new subsection (Section 3.3) that formally states the fixed-point property, provides a local Lyapunov argument for stability near the data manifold, and discusses the implications for early-exit and the ||v||-based OOD signal. These additions will strengthen the justification without altering the core claims. revision: yes
Referee: [Experiments] Experimental section: The abstract asserts validation on simulation benchmarks and scaling to pi0 VLA models with improved success rates, but provides no details on experimental controls, baselines, error bars, statistical significance, or ablation studies on the optimization procedure. This leaves the quantitative claims unverified and the soundness of the adaptive and safety benefits difficult to assess.

Authors: We agree that the original experimental section was insufficiently detailed. In the revised manuscript we will expand Section 4 to include: complete descriptions of all baselines (standard flow matching, diffusion policies, and ablated variants), implementation hyperparameters, results reported with mean and standard deviation over at least five random seeds, statistical significance tests (paired t-tests with p-values), and dedicated ablations on the optimization procedure (number of steps, convergence threshold, and their effect on success rate and compute). These revisions will make the quantitative improvements in success rate, efficiency, and the reliability of the adaptive and safety mechanisms fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; stationary velocity field presented as emergent learned object

full rationale

The paper derives GeCO by training a time-unconditional velocity field via standard flow matching on action sequences, then interprets the resulting geometry as forming stable attractors for expert behaviors. This interpretation is an empirical consequence of the training objective rather than an algebraic identity or fitted parameter renamed as a prediction. No load-bearing step reduces to self-citation for uniqueness, ansatz smuggling, or self-definition; the OOD norm signal and adaptive early-exit follow directly from the learned field's properties without presupposing the target claims. The derivation chain remains independent of its inputs and self-contained against external flow-matching results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the existence of a stationary velocity field whose attractors correspond to expert behaviors; no explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Expert behaviors can be represented as stable attractors in a stationary velocity field over action sequences
Invoked to justify both adaptive convergence and the OOD norm signal.

invented entities (1)

stationary velocity field no independent evidence
purpose: To enable time-unconditional iterative optimization for action synthesis
Core new object introduced by GeCO; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5544 in / 1302 out tokens · 42423 ms · 2026-05-15T08:35:11.112074+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost_unit0; Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors... the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection; RCLCombiner_isCoupling_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

g⋆(a,ε,γ)=(ε−a)c(γ) with c(1)=0... transforming ground-truth action sequences into natural stationary equilibrium points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 21 internal anchors

[1]

Is conditional generative model- ing all you need for decision-making?arXiv preprint arXiv:2211.15657,

Ajay, A., Du, Y ., Gupta, A., Tenenbaum, J., Jaakkola, T., and Agrawal, P. Is conditional generative model- ing all you need for decision-making?arXiv preprint arXiv:2211.15657,

work page arXiv
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

PaliGemma: A versatile 3B VLM for transfer

Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschan- nen, M., Bugliarello, E., et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. π0: A vision-language-action flow model for 9 Generative Control as Optimization general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Y ., and Levine, S

Black, K., Galliker, M. Y ., and Levine, S. Real-time exe- cution of action chunking flow policies.arXiv preprint arXiv:2506.07339,

work page arXiv
[6]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y ., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., and Li, H. Univla: Learning to act anywhere with task- centric latent actions.arXiv preprint arXiv:2505.06111,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W. T., Malik, J., Abbeel, P., Tedrake, R., et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025a. Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., et al. Robotwin 2.0: A scal- able data generato...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Conditioning matters: Training diffusion policies is faster than you think.arXiv preprint arXiv:2505.11123,

Dong, Z., Liu, Y ., Li, Y ., Zhao, H., and Hao, J. Conditioning matters: Training diffusion policies is faster than you think.arXiv preprint arXiv:2505.11123,

work page arXiv
[9]

J., Paterson, C., and Habli, I

Hodge, V . J., Paterson, C., and Habli, I. Out-of-distribution detection for safety assurance of ai and autonomous sys- tems.arXiv preprint arXiv:2510.21254,

work page arXiv
[10]

Flexible loco- motion learning with diffusion model predictive control

Huang, R., Balim, H., Yang, H., and Du, Y . Flexible loco- motion learning with diffusion model predictive control. arXiv preprint arXiv:2510.04234, 2025a. Huang, X., Truong, T., Zhang, Y ., Yu, F., Sleiman, J. P., Hodgins, J., Sreenath, K., and Farshidian, F. Diffuse-cloc: Guided diffusion for physics-based character look-ahead control.ACM Transactions o...

work page arXiv
[11]

Planning with Diffusion for Flexible Behavior Synthesis

Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

Jiang, T., Yuan, T., Liu, Y ., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., and Zhao, H. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

work page arXiv
[13]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

Li, Y ., Ma, X., Xu, J., Cui, Y ., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y ., Niu, H., et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801,

work page arXiv
[16]

Code as Policies: Language Model Programs for Embodied Control

Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Flow Matching for Generative Modeling

10 Generative Control as Optimization Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing.Advances in neural information processing systems, 36:34892–34916, 2023b. ...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y ., Zhang, K., Li, Y ., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y ., Sun, H., Gao, J., et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

Shen, Y ., Wei, F., Du, Z., Liang, Y ., Lu, Y ., Yang, J., Zheng, N., and Guo, B. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

work page arXiv
[23]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[25]

Is noise condition- ing necessary for denoising generative models?arXiv preprint arXiv:2502.13129,

Sun, Q., Jiang, Z., Zhao, H., and He, K. Is noise condition- ing necessary for denoising generative models?arXiv preprint arXiv:2502.13129,

work page arXiv
[26]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Tian, X., Gu, J., Li, B., Liu, Y ., Wang, Y ., Zhao, Z., Zhan, K., Jia, P., Lang, X., and Zhao, H. Drivevlm: The conver- gence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

and Du, Y

Wang, R. and Du, Y . Equilibrium matching: Genera- tive modeling with implicit energy-based models.arXiv preprint arXiv:2510.02300,

work page arXiv
[28]

Video models are zero-shot learners and reasoners

Wiedemer, T., Li, Y ., Vicol, P., Gu, S. S., Matarese, N., Swer- sky, K., Kim, B., Jaini, P., and Geirhos, R. Video mod- els are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer.arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y ., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T. Z., Kumar, V ., Levine, S., and Finn, C. Learn- ing fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

V ., Guntupalli, J

Zhou, G., Swaminathan, S., Raju, R. V ., Guntupalli, J. S., Lehrach, W., Ortiz, J., Dedieu, A., L´azaro-Gredilla, M., and Murphy, K. Diffusion model predictive control.arXiv preprint arXiv:2410.05364,

work page arXiv
[33]

We use the same network backbone as a Continuous Rectified Flow policy (Liu et al., 2022), but differ in the learning objective and the test-time inference algorithm

codebase. We use the same network backbone as a Continuous Rectified Flow policy (Liu et al., 2022), but differ in the learning objective and the test-time inference algorithm. A.2. Model Architecture We follow the Rectified Flow architecture with a 1D diffusion-style transformer and a frozen vision-language conditioner. Vision-language condition.We use a...

work page 2022
[34]

The T5 hidden dimension is set by the pretrained configuration: t5 hidden dim←T5Config.from pretrained(t5 model).d model

text encoder. The T5 hidden dimension is set by the pretrained configuration: t5 hidden dim←T5Config.from pretrained(t5 model).d model. The conditioner is instantiated as: •ViTAndT5VisionLanguageConditionwithemb dim=768,freeze=True,To=1, andn views=2. Policy backbone.The action generator uses a DiT-style 1D transformer with cross-attention conditioning: •...

work page 2025
[35]

ForNut Assembly, GeCO uses an average of 3.50 NFEs and 166.7 ms total inference time per decision

Figure 6.Adaptive computation during real-world execution.We visualize the per-step Number of Function Evaluations (NFE) and total inference time over time for representative physical rollouts ofNut AssemblyandTube Arrangement. ForNut Assembly, GeCO uses an average of 3.50 NFEs and 166.7 ms total inference time per decision. ForTube Arrangement, GeCO uses...

work page 2097