pith. machine review for the scientific record. sign in

arxiv: 2603.17834 · v2 · submitted 2026-03-18 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords flow matchingrobotic imitation learningadaptive controlvelocity fieldout-of-distribution detectiongenerative policiesaction optimization
0
0 comments X

The pith

A stationary velocity field turns flow matching into adaptive optimization for robotic control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Generative Control as Optimization (GeCO), a method that learns a stationary velocity field over action sequences instead of relying on fixed-time integration in flow matching. Expert behaviors act as attractors in this field, so inference can run until convergence rather than a preset number of steps. This setup lets the policy spend less computation on straightforward states and more on complex ones. The same field norm also acts as a built-in detector for states outside the training distribution. The approach is tested as a drop-in replacement for standard flow-matching components in robot policies and vision-language-action models.

Core claim

GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence—exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies.

What carries the argument

The stationary velocity field over action sequences that has expert behaviors as its stable attractors.

If this is right

  • Test-time computation varies with task difficulty instead of staying fixed.
  • Success rates improve on standard benchmarks when used in place of conventional flow matching.
  • The field norm provides an immediate, training-free signal for detecting out-of-distribution inputs.
  • Scaling to large vision-language-action models occurs without changing the training procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The adaptive stopping rule may reduce energy use on embedded robot hardware for routine motions.
  • Combining the norm signal with other uncertainty measures could strengthen safety layers in deployed systems.
  • The attractor geometry might extend to multi-task settings where different expert behaviors compete as separate basins.

Load-bearing premise

A learned stationary velocity field will reliably form stable attractors for expert behaviors across the state distribution and the field norm will serve as a robust OOD detector without excessive false positives or negatives in real robotic deployments.

What would settle it

Compare the velocity field norm on in-distribution versus out-of-distribution states during policy execution and check whether early stopping based on convergence preserves task success rates across varying state difficulties.

Figures

Figures reproduced from arXiv: 2603.17834 by Hang Zhao, Linzhan Mou, Runhan Huang, Shaoting Zhu, Yicheng Liu, Zunzhe Zhang.

Figure 1
Figure 1. Figure 1: Generative Control as Optimization (GeCO). (A) The Paradigm Shift: Unlike standard flow matching which relies on rigid, time-dependent integration schedules (top), GeCO learns a stationary velocity field where inference becomes an iterative optimization process toward stable attractors (bottom). (B) Adaptive Computation: This formulation enables the policy to dynamically allocate computational budget based… view at source ↗
Figure 2
Figure 2. Figure 2: Computation Follows Task Complexity. We visualize the spatial distribution of inference effort along a single rollout. The first three panels (a–c) are sampled from LIBERO-Spatial, and the last three panels (d–f) are from LIBERO-Object. The color of each line encodes the number of function evaluations (NFE) required for convergence at that state, ranging from blue (NFE = 1) to red (NFE = 20). This visualiz… view at source ↗
Figure 3
Figure 3. Figure 3: GeCO policy execution for the Nut Assembly task. The robot performs high-precision alignment and rotational insertion. 1 2 3 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GeCO policy execution for the Chemistry Tube Arrangement task. The policy adaptively handles the tight-tolerance insertion of fragile tubes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task setups for the real-world robotic deployment, showing the configurations for both the Nut Assembly and the Chemistry Tube Arrangement tasks. • Nut Assembly: The robot shall grasp a plastic nut, accurately align it with a threaded bolt, and perform the insertion. A trial is considered successful only when both two nuts are fully threaded onto their respective bolts. • Chemistry Tube Arrangement: The ro… view at source ↗
Figure 6
Figure 6. Figure 6: Adaptive computation during real-world execution. We visualize the per-step Number of Function Evaluations (NFE) and total inference time over time for representative physical rollouts of Nut Assembly and Tube Arrangement. For Nut Assembly, GeCO uses an average of 3.50 NFEs and 166.7 ms total inference time per decision. For Tube Arrangement, GeCO uses an average of 3.73 NFEs and 147.0 ms per decision. Rat… view at source ↗
read the original abstract

Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence--exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Generative Control as Optimization (GeCO), a time-unconditional flow matching framework for robotic imitation learning. It learns a stationary velocity field in action-sequence space such that expert behaviors become stable attractors. This allows test-time inference via iterative optimization that adapts computation to task difficulty, with early exit for simple states, and uses the velocity field norm as a training-free OOD detector. The approach is presented as a plug-and-play replacement for standard flow-matching heads, validated on simulation benchmarks and scaled to pi0 VLA models, claiming improvements in success rates and efficiency.

Significance. Should the stationary velocity field reliably produce stable attractors for expert actions, this work would offer a meaningful advance in efficient and robust robotic control policies. The adaptive inference and training-free safety signal address important practical limitations of current generative models in robotics, potentially reducing computational overhead and enhancing deployment safety.

major comments (2)
  1. [Abstract] Abstract and method description: The central claim that the learned time-unconditional velocity field v(a,s) forms stable attractors requires v(a*,s)=0 at expert actions a* with attractive dynamics under da/dt = v(a,s). Standard flow-matching training matches velocities along paths but does not impose v=0 on the data manifold or Lyapunov stability when time conditioning is removed, so iterative optimization may converge to incorrect points. This directly undermines the adaptive early-exit mechanism and the claim that ||v(a_opt,s)|| is a reliable OOD detector.
  2. [Experiments] Experimental section: The abstract asserts validation on simulation benchmarks and scaling to pi0 VLA models with improved success rates, but provides no details on experimental controls, baselines, error bars, statistical significance, or ablation studies on the optimization procedure. This leaves the quantitative claims unverified and the soundness of the adaptive and safety benefits difficult to assess.
minor comments (2)
  1. [Method] The manuscript should clarify the exact form of the time-unconditional loss and any regularization terms used to encourage fixed points at data.
  2. [Figures] Figure captions and pseudocode for the test-time optimization loop would improve clarity on how convergence-based early exit is implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to provide additional theoretical clarification and experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: The central claim that the learned time-unconditional velocity field v(a,s) forms stable attractors requires v(a*,s)=0 at expert actions a* with attractive dynamics under da/dt = v(a,s). Standard flow-matching training matches velocities along paths but does not impose v=0 on the data manifold or Lyapunov stability when time conditioning is removed, so iterative optimization may converge to incorrect points. This directly undermines the adaptive early-exit mechanism and the claim that ||v(a_opt,s)|| is a reliable OOD detector.

    Authors: We appreciate this important theoretical point. In the GeCO formulation, the stationary velocity field is obtained by regressing to the time-averaged target velocity from the flow-matching objective; this construction yields v(a*,s) = 0 at expert actions a* by design, as the data distribution is the equilibrium of the learned dynamics. While global Lyapunov stability is not guaranteed by the basic regression loss alone, local attractivity around expert actions is supported by our convergence analysis and empirical results on both simulation and VLA tasks. To address the concern directly, we will add a new subsection (Section 3.3) that formally states the fixed-point property, provides a local Lyapunov argument for stability near the data manifold, and discusses the implications for early-exit and the ||v||-based OOD signal. These additions will strengthen the justification without altering the core claims. revision: yes

  2. Referee: [Experiments] Experimental section: The abstract asserts validation on simulation benchmarks and scaling to pi0 VLA models with improved success rates, but provides no details on experimental controls, baselines, error bars, statistical significance, or ablation studies on the optimization procedure. This leaves the quantitative claims unverified and the soundness of the adaptive and safety benefits difficult to assess.

    Authors: We agree that the original experimental section was insufficiently detailed. In the revised manuscript we will expand Section 4 to include: complete descriptions of all baselines (standard flow matching, diffusion policies, and ablated variants), implementation hyperparameters, results reported with mean and standard deviation over at least five random seeds, statistical significance tests (paired t-tests with p-values), and dedicated ablations on the optimization procedure (number of steps, convergence threshold, and their effect on success rate and compute). These revisions will make the quantitative improvements in success rate, efficiency, and the reliability of the adaptive and safety mechanisms fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; stationary velocity field presented as emergent learned object

full rationale

The paper derives GeCO by training a time-unconditional velocity field via standard flow matching on action sequences, then interprets the resulting geometry as forming stable attractors for expert behaviors. This interpretation is an empirical consequence of the training objective rather than an algebraic identity or fitted parameter renamed as a prediction. No load-bearing step reduces to self-citation for uniqueness, ansatz smuggling, or self-definition; the OOD norm signal and adaptive early-exit follow directly from the learned field's properties without presupposing the target claims. The derivation chain remains independent of its inputs and self-contained against external flow-matching results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the existence of a stationary velocity field whose attractors correspond to expert behaviors; no explicit free parameters or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption Expert behaviors can be represented as stable attractors in a stationary velocity field over action sequences
    Invoked to justify both adaptive convergence and the OOD norm signal.
invented entities (1)
  • stationary velocity field no independent evidence
    purpose: To enable time-unconditional iterative optimization for action synthesis
    Core new object introduced by GeCO; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5544 in / 1302 out tokens · 42423 ms · 2026-05-15T08:35:11.112074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 21 internal anchors

  1. [1]

    Is conditional generative model- ing all you need for decision-making?arXiv preprint arXiv:2211.15657,

    Ajay, A., Du, Y ., Gupta, A., Tenenbaum, J., Jaakkola, T., and Agrawal, P. Is conditional generative model- ing all you need for decision-making?arXiv preprint arXiv:2211.15657,

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    PaliGemma: A versatile 3B VLM for transfer

    Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschan- nen, M., Bugliarello, E., et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. π0: A vision-language-action flow model for 9 Generative Control as Optimization general robot control.arXiv preprint arXiv:2410.24164,

  5. [5]

    Y ., and Levine, S

    Black, K., Galliker, M. Y ., and Levine, S. Real-time exe- cution of action chunking flow policies.arXiv preprint arXiv:2506.07339,

  6. [6]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y ., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., and Li, H. Univla: Learning to act anywhere with task- centric latent actions.arXiv preprint arXiv:2505.06111,

  7. [7]

    Large Video Planner Enables Generalizable Robot Control

    Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W. T., Malik, J., Abbeel, P., Tedrake, R., et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025a. Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., et al. Robotwin 2.0: A scal- able data generato...

  8. [8]

    Conditioning matters: Training diffusion policies is faster than you think.arXiv preprint arXiv:2505.11123,

    Dong, Z., Liu, Y ., Li, Y ., Zhao, H., and Hao, J. Conditioning matters: Training diffusion policies is faster than you think.arXiv preprint arXiv:2505.11123,

  9. [9]

    J., Paterson, C., and Habli, I

    Hodge, V . J., Paterson, C., and Habli, I. Out-of-distribution detection for safety assurance of ai and autonomous sys- tems.arXiv preprint arXiv:2510.21254,

  10. [10]

    Flexible loco- motion learning with diffusion model predictive control

    Huang, R., Balim, H., Yang, H., and Du, Y . Flexible loco- motion learning with diffusion model predictive control. arXiv preprint arXiv:2510.04234, 2025a. Huang, X., Truong, T., Zhang, Y ., Yu, F., Sleiman, J. P., Hodgins, J., Sreenath, K., and Farshidian, F. Diffuse-cloc: Guided diffusion for physics-based character look-ahead control.ACM Transactions o...

  11. [11]

    Planning with Diffusion for Flexible Behavior Synthesis

    Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

  12. [12]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

    Jiang, T., Yuan, T., Liu, Y ., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., and Zhao, H. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

  13. [13]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

  14. [14]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

  15. [15]

    Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

    Li, Y ., Ma, X., Xu, J., Cui, Y ., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y ., Niu, H., et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801,

  16. [16]

    Code as Policies: Language Model Programs for Embodied Control

    Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753,

  17. [17]

    Flow Matching for Generative Modeling

    10 Generative Control as Optimization Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  18. [18]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing.Advances in neural information processing systems, 36:34892–34916, 2023b. ...

  19. [19]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Liu, Y ., Zhang, K., Li, Y ., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y ., Sun, H., Gao, J., et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177,

  20. [20]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  21. [21]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

  22. [22]

    Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

    Shen, Y ., Wei, F., Du, Z., Liang, Y ., Lu, Y ., Yang, J., Zheng, N., and Guo, B. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

  23. [23]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

  24. [24]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  25. [25]

    Is noise condition- ing necessary for denoising generative models?arXiv preprint arXiv:2502.13129,

    Sun, Q., Jiang, Z., Zhao, H., and He, K. Is noise condition- ing necessary for denoising generative models?arXiv preprint arXiv:2502.13129,

  26. [26]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Tian, X., Gu, J., Li, B., Liu, Y ., Wang, Y ., Zhao, Z., Zhan, K., Jia, P., Lang, X., and Zhao, H. Drivevlm: The conver- gence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289,

  27. [27]

    and Du, Y

    Wang, R. and Du, Y . Equilibrium matching: Genera- tive modeling with implicit energy-based models.arXiv preprint arXiv:2510.02300,

  28. [28]

    Video models are zero-shot learners and reasoners

    Wiedemer, T., Li, Y ., Vicol, P., Gu, S. S., Matarese, N., Swer- sky, K., Kim, B., Jaini, P., and Geirhos, R. Video mod- els are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328,

  29. [29]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer.arXiv preprint arXiv:2408.06072,

  30. [30]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Ze, Y ., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954,

  31. [31]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V ., Levine, S., and Finn, C. Learn- ing fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

  32. [32]

    V ., Guntupalli, J

    Zhou, G., Swaminathan, S., Raju, R. V ., Guntupalli, J. S., Lehrach, W., Ortiz, J., Dedieu, A., L´azaro-Gredilla, M., and Murphy, K. Diffusion model predictive control.arXiv preprint arXiv:2410.05364,

  33. [33]

    We use the same network backbone as a Continuous Rectified Flow policy (Liu et al., 2022), but differ in the learning objective and the test-time inference algorithm

    codebase. We use the same network backbone as a Continuous Rectified Flow policy (Liu et al., 2022), but differ in the learning objective and the test-time inference algorithm. A.2. Model Architecture We follow the Rectified Flow architecture with a 1D diffusion-style transformer and a frozen vision-language conditioner. Vision-language condition.We use a...

  34. [34]

    The T5 hidden dimension is set by the pretrained configuration: t5 hidden dim←T5Config.from pretrained(t5 model).d model

    text encoder. The T5 hidden dimension is set by the pretrained configuration: t5 hidden dim←T5Config.from pretrained(t5 model).d model. The conditioner is instantiated as: •ViTAndT5VisionLanguageConditionwithemb dim=768,freeze=True,To=1, andn views=2. Policy backbone.The action generator uses a DiT-style 1D transformer with cross-attention conditioning: •...

  35. [35]

    ForNut Assembly, GeCO uses an average of 3.50 NFEs and 166.7 ms total inference time per decision

    Figure 6.Adaptive computation during real-world execution.We visualize the per-step Number of Function Evaluations (NFE) and total inference time over time for representative physical rollouts ofNut AssemblyandTube Arrangement. ForNut Assembly, GeCO uses an average of 3.50 NFEs and 166.7 ms total inference time per decision. ForTube Arrangement, GeCO uses...