Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

Feng Zheng; Kangyi Ji; Long Cheng; Rongtao Xu; Yongxiang Zou; Yu Zhang

arxiv: 2606.19194 · v1 · pith:IPC4N2BPnew · submitted 2026-06-17 · 💻 cs.RO

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

Yu Zhang , Kangyi Ji , Yongxiang Zou , Rongtao Xu , Feng Zheng , Long Cheng This is my paper

Pith reviewed 2026-06-26 20:28 UTC · model grok-4.3

classification 💻 cs.RO

keywords invertible neural networkflow matchingrobot manipulationone-step denoisingvision-language-actiondexterous actionsinference latency

0 comments

The pith

An invertible neural network adapter constrains flow-matching trajectories to enable precise one-step robot action generation from multimodal inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adapter that embeds a flow-matching policy inside an invertible latent space so that high-dimensional dexterous actions can be produced from visual, language, and proprioceptive observations in a single denoising step. Conventional iterative flow-matching policies require repeated refinement to reach usable accuracy; the adapter removes that requirement by construction. Experiments across simulation suites and real robot platforms show that task performance stays comparable to or better than multi-step baselines while average inference latency drops from 110 ms to 61 ms. The central claim is therefore that invertibility supplies a sufficient constraint to replace iteration with a single forward pass without loss of precision or stability.

Core claim

Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step.

What carries the argument

Invertible neural network adapter that maps multimodal observations into a reversible latent representation for single-pass flow matching.

If this is right

Inference complexity is substantially reduced relative to conventional iterative flow-matching policies.
Action prediction accuracy and stability remain strong across diverse manipulation tasks.
Simulation benchmarks show consistent superior or near state-of-the-art performance.
Real-world VLA models obtain a measured reduction in average inference latency from 110 ms to 61 ms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same invertibility constraint could be applied to other high-dimensional conditional generation problems that currently rely on multi-step diffusion or flow models.
Lower per-step latency may allow closed-loop control rates high enough for contact-rich or fast-moving manipulation without specialized hardware.
Because the adapter is described as general, it could be attached to existing pretrained VLA backbones rather than requiring full retraining.

Load-bearing premise

Constraining trajectories inside an invertible latent space is enough to let flow matching recover precise high-dimensional actions without any iterative refinement.

What would settle it

On the same manipulation benchmarks, measure whether single-step outputs from the adapter exhibit materially higher action error or task failure rates than the iterative flow-matching baseline run to convergence.

Figures

Figures reproduced from arXiv: 2606.19194 by Feng Zheng, Kangyi Ji, Long Cheng, Rongtao Xu, Yongxiang Zou, Yu Zhang.

**Figure 2.** Figure 2: Chosen tasks on the RoboTwin benchmark rather than sampled from Gaussian noise; second, timestep embeddings are omitted during training and inference. These modifications follow the empirical design choices for improving stability and performance in the 3D setting. The compared algorithms are set consistent with those in the paper ManiFlow [16]. All data are trained using one 4090 GPU, and the quantitative… view at source ↗

**Figure 3.** Figure 3: Real-world robotic manipulation tasks Overall, the method achieves a favorable trade-off between computational efficiency and policy quality, reducing both training and inference overhead while delivering superior or comparable results. 3.2 Real World Experiments To further validate the effectiveness of the proposed invertible adapter in real-world robotic manipulation, three representative tasks are eva… view at source ↗

read the original abstract

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a practical invertible adapter that cuts flow-matching inference to one step for robot actions, with reported latency drop from 110 ms to 61 ms, but the reason one step suffices in high dimensions rests on an unproven assumption about the velocity field.

read the letter

The core idea is an invertible neural adapter placed around a flow-matching policy so that action generation for manipulation tasks happens in a single step. The authors test it on simulation benchmarks and real robots, showing it matches or beats prior iterative versions on task success while halving average latency for vision-language-action models.

The experiments appear to cover a useful range of tasks and platforms, and the latency numbers are concrete. That part is the most immediately usable result for anyone deploying these policies.

The weak point is the central claim that mapping into an invertible latent space automatically lets the flow ODE integrate accurately in one step. Invertibility gives a bijection but says nothing about the Lipschitz constant or curvature of the learned velocity field; high-dimensional action spaces (20–100 dims) usually need several Euler or Heun steps to keep error low. The abstract gives no explicit form for the velocity network, no integration-error bound, and no ablation that isolates why one step works here. If the full paper supplies those details or shows the adapter actually flattens the field, the claim strengthens; otherwise it remains an empirical observation without clear mechanism.

This is for robotics groups already running flow-matching or diffusion policies who need faster inference. It is worth sending to review because the speed gain is measurable and the adapter is a concrete engineering move, even though the theoretical justification for single-step accuracy needs closer inspection.

Referee Report

2 major / 2 minor

Summary. The paper proposes an invertible neural network adapter for flow-matching policies in robotic manipulation. Conditioned on multimodal inputs, the adapter maps actions into an invertible latent space so that the flow-matching ODE can be integrated accurately in a single step rather than iteratively, yielding high-dimensional dexterous actions with substantially lower inference latency (110 ms to 61 ms) while preserving task performance; the method is evaluated on simulation benchmarks and real-world platforms.

Significance. If the one-step claim is rigorously supported, the adapter would offer a practical route to real-time flow-matching policies for high-DoF manipulation and VLA models, where iterative integration has been a bottleneck. The reported latency halving without apparent loss of accuracy would be a notable engineering contribution if backed by velocity-field analysis and controlled experiments.

major comments (2)

[Abstract / §3] Abstract / §3 (method): the central assertion that constraining the trajectory to an invertible latent space 'enables efficient and high-quality dexterous action synthesis with only a single inference step' is not accompanied by the explicit form of the velocity network, any bound on its Lipschitz constant or curvature, or an integration-error analysis showing why one Euler/Heun step suffices in 20–100-dimensional action spaces. Invertibility alone supplies bijectivity but no guarantee on the numerical properties required for single-step accuracy.
[Experiments] Experiments section: the abstract states 'superior or near state-of-the-art performance' and 'strong task performance' across simulation and real-world tasks, yet supplies no baseline descriptions, metric definitions, trial counts, variance estimates, or ablation isolating the effect of the one-step adapter versus the invertible mapping itself. Without these, the performance claims cannot be assessed.

minor comments (2)

[§3] Notation for the adapter and latent-space mapping should be introduced with explicit equations early in the method section to allow readers to verify the claimed invertibility.
[Real-world experiments] The real-world latency numbers (110 ms to 61 ms) should specify the hardware, batch size, and whether the measurement includes observation encoding or only the flow step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the theoretical justification and experimental reporting without altering the core claims.

read point-by-point responses

Referee: [Abstract / §3] Abstract / §3 (method): the central assertion that constraining the trajectory to an invertible latent space 'enables efficient and high-quality dexterous action synthesis with only a single inference step' is not accompanied by the explicit form of the velocity network, any bound on its Lipschitz constant or curvature, or an integration-error analysis showing why one Euler/Heun step suffices in 20–100-dimensional action spaces. Invertibility alone supplies bijectivity but no guarantee on the numerical properties required for single-step accuracy.

Authors: We agree that the manuscript would benefit from a more explicit theoretical treatment. In the revision we will add the explicit form of the velocity network (currently implicit in §3) together with a short integration-error analysis. The analysis will show that the invertible adapter maps the action trajectory into a latent space whose velocity field has reduced curvature relative to the original action space, thereby keeping the local truncation error of a single Euler/Heun step below the tolerance required for the reported task performance. We will also state the Lipschitz bound that follows from the architecture of the adapter. revision: yes
Referee: [Experiments] Experiments section: the abstract states 'superior or near state-of-the-art performance' and 'strong task performance' across simulation and real-world tasks, yet supplies no baseline descriptions, metric definitions, trial counts, variance estimates, or ablation isolating the effect of the one-step adapter versus the invertible mapping itself. Without these, the performance claims cannot be assessed.

Authors: The referee correctly identifies gaps in experimental documentation. The revised manuscript will expand the Experiments section to include: (i) explicit descriptions and citations for all baselines, (ii) precise definitions of every reported metric, (iii) the number of independent trials and random seeds used, (iv) standard-deviation or confidence-interval estimates, and (v) an ablation that decouples the contribution of the one-step inference schedule from the invertible mapping itself. These additions will be placed in the main text and supplementary material as appropriate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental evaluation

full rationale

The provided abstract and description contain no derivation chain, equations, or self-referential definitions that reduce a claimed result to its own inputs by construction. The central claim (one-step flow matching via invertible adapter) is presented as an empirical outcome validated across simulation and real-world benchmarks, with latency and accuracy numbers reported as measured results rather than tautological predictions. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatz smuggling are identifiable in the given text. This is the expected non-finding for a methods paper whose primary support is external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

With only the abstract available, no specific free parameters, axioms, or invented entities can be identified from the text.

pith-pipeline@v0.9.1-grok · 5710 in / 1209 out tokens · 33051 ms · 2026-06-26T20:28:51.579629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 12 linked inside Pith

[1]

H. Wang, W . Zhao, X. Wang, S. Huang, H. Lin, B. Zheng, R. Xu, G. Wang, Y . Mu, H. Wang, et al. Dexjoco: A benchmark and toolkit for task-oriented de xterous manipulation on mujoco. arXiv preprint arXiv:2605.16257 , 2026

Pith/arXiv arXiv 2026
[2]

Y . Sun, M. Cao, P . Y ang, R. Xu, Y . Y an, R. Xu, L. Ma, R. Gan, A. Zhai, Q. Chen, et al. Maniparena: Comprehensive real-world evaluation of reaso ning-oriented generalist robot ma- nipulation. arXiv preprint arXiv:2603.28545 , 2026. 8

arXiv 2026
[3]

Zhang, K

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and W . He. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024

Pith/arXiv arXiv 2024
[4]

X. Han, S. Chen, Z. Fu, Z. Feng, L. Fan, D. An, C. Wang, L. Guo , W . Meng, X. Zhang, et al. Multimodal fusion and vision-language models: A surv ey for robot vision. Information Fusion, page 103652, 2025

2025
[5]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchﬁel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[6]

Z. Hu, S. Zhou, Q. Zhang, R. Xu, Q. Su, and C.-J. Liang. Anys lot: Goal-conditioned vision- language-action policies for zero-shot slot-level placem ent. arXiv preprint arXiv:2604.10432 , 2026

Pith/arXiv arXiv 2026
[7]

Zhang, J

K. Zhang, J. Zhang, R. Xu, Y . Sun, S. Xue, Y . Wen, X. Guo, M. Guo, W . Liufu, L. Zihou, et al. A1: A fully transparent open-source, adaptive and efﬁcient truncated vision-language-action model. arXiv preprint arXiv:2604.05672 , 2026

Pith/arXiv arXiv 2026
[8]

R. Xu, J. Zhang, M. Guo, Y . Wen, H. Y ang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang, et al. A0: An affordance-aware hierarchical model for general rob otic manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Visio n, pages 13491–13501, 2025

2025
[9]

L. Ma, J. Wen, M. Lin, R. Xu, X. Liang, B. Lin, J. Ma, Y . Wang, Z. Wei, H. Lin, et al. Phyblock: A progressive benchmark for physical understand ing and planning via 3d block assembly. arXiv preprint arXiv:2506.08708 , 2025

arXiv 2025
[10]

Zhang, R

K. Zhang, R. Xu, P . Ren, J. Lin, H. Wu, L. Lin, and X. Liang. Robridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation. arXiv preprint arXiv:2505.01709, 2025

arXiv 2025
[11]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. F low matching for generative modeling. arXiv preprint arXiv:2210.02747 , 2022

Pith/arXiv arXiv 2022
[12]

Chisari, N

E. Chisari, N. Heppert, M. Argus, T. Welschehold, T. Bro x, and A. V alada. Learning robotic manipulation policies from point clouds with conditional ﬂ ow matching. arXiv preprint arXiv:2409.07343, 2024

arXiv 2024
[13]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn , N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action ﬂow model fo r general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[14]

Braun, N

M. Braun, N. Jaquier, L. Rozo, and T. Asfour. Riemannian ﬂow matching policy for robot mo- tion learning. In 2024 IEEE/RSJ International Conference on Intelligent Rob ots and Systems (IROS), pages 5144–5151. IEEE, 2024

2024
[15]

Zhang and M

F. Zhang and M. Gienger. Affordance-based robot manipu lation with ﬂow matching. arXiv preprint arXiv:2409.01083, 2024

arXiv 2024
[16]

G. Y an, J. Zhu, Y . Deng, S. Y ang, R.-Z. Qiu, X. Cheng, M. Me mmel, R. Krishna, A. Goyal, X. Wang, and D. Fox. ManiFlow: A general robot manipulation p olicy via consistency ﬂow training. In Conference on Robot Learning (CoRL) , 2025

2025
[17]

Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He. Mean ﬂows for one-step generative modeling. arXiv preprint arXiv:2505.13447 , 2025

Pith/arXiv arXiv 2025
[18]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learn ing to generate and transfer data with rectiﬁed ﬂow. 2023. URL https://openreview.net/forum?id=XVjTT1nw5z. 9

2023
[19]

L. Dinh, D. Krueger, and Y . Bengio. Nice: Non-linear ind ependent components estimation. arXiv preprint arXiv:1410.8516 , 2014

Pith/arXiv arXiv 2014
[20]

T. Y u, D. Quillen, Z. He, R. C. Julian, K. Hausman, C. Finn , and S. Levine. Meta-world: A benchmark and evaluation for multi -task and meta reinforcement learning. In Conference on Robot Learning , 2019. URL https://api.semanticscholar.org/CorpusID:204852201

2019
[21]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P . Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310 , 2023

Pith/arXiv arXiv 2023
[22]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng , W . Ding, C. Gao, C. Ge, W . Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Lu o, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P . Wang, P. ...

Pith/arXiv arXiv 2025
[23]

Community

S. Community. Starvla: A lego-like codebase for vision -language-action model developing. arXiv preprint arXiv:2604.05014 , 2026

Pith/arXiv arXiv 2026
[24]

Physical Intelligence, Black, N

K. Physical Intelligence, Black, N. Brown, D. James, D. Karan, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0. 5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025. 10

Pith/arXiv arXiv 2025

[1] [1]

H. Wang, W . Zhao, X. Wang, S. Huang, H. Lin, B. Zheng, R. Xu, G. Wang, Y . Mu, H. Wang, et al. Dexjoco: A benchmark and toolkit for task-oriented de xterous manipulation on mujoco. arXiv preprint arXiv:2605.16257 , 2026

Pith/arXiv arXiv 2026

[2] [2]

Y . Sun, M. Cao, P . Y ang, R. Xu, Y . Y an, R. Xu, L. Ma, R. Gan, A. Zhai, Q. Chen, et al. Maniparena: Comprehensive real-world evaluation of reaso ning-oriented generalist robot ma- nipulation. arXiv preprint arXiv:2603.28545 , 2026. 8

arXiv 2026

[3] [3]

Zhang, K

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and W . He. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024

Pith/arXiv arXiv 2024

[4] [4]

X. Han, S. Chen, Z. Fu, Z. Feng, L. Fan, D. An, C. Wang, L. Guo , W . Meng, X. Zhang, et al. Multimodal fusion and vision-language models: A surv ey for robot vision. Information Fusion, page 103652, 2025

2025

[5] [5]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchﬁel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[6] [6]

Z. Hu, S. Zhou, Q. Zhang, R. Xu, Q. Su, and C.-J. Liang. Anys lot: Goal-conditioned vision- language-action policies for zero-shot slot-level placem ent. arXiv preprint arXiv:2604.10432 , 2026

Pith/arXiv arXiv 2026

[7] [7]

Zhang, J

K. Zhang, J. Zhang, R. Xu, Y . Sun, S. Xue, Y . Wen, X. Guo, M. Guo, W . Liufu, L. Zihou, et al. A1: A fully transparent open-source, adaptive and efﬁcient truncated vision-language-action model. arXiv preprint arXiv:2604.05672 , 2026

Pith/arXiv arXiv 2026

[8] [8]

R. Xu, J. Zhang, M. Guo, Y . Wen, H. Y ang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang, et al. A0: An affordance-aware hierarchical model for general rob otic manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Visio n, pages 13491–13501, 2025

2025

[9] [9]

L. Ma, J. Wen, M. Lin, R. Xu, X. Liang, B. Lin, J. Ma, Y . Wang, Z. Wei, H. Lin, et al. Phyblock: A progressive benchmark for physical understand ing and planning via 3d block assembly. arXiv preprint arXiv:2506.08708 , 2025

arXiv 2025

[10] [10]

Zhang, R

K. Zhang, R. Xu, P . Ren, J. Lin, H. Wu, L. Lin, and X. Liang. Robridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation. arXiv preprint arXiv:2505.01709, 2025

arXiv 2025

[11] [11]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. F low matching for generative modeling. arXiv preprint arXiv:2210.02747 , 2022

Pith/arXiv arXiv 2022

[12] [12]

Chisari, N

E. Chisari, N. Heppert, M. Argus, T. Welschehold, T. Bro x, and A. V alada. Learning robotic manipulation policies from point clouds with conditional ﬂ ow matching. arXiv preprint arXiv:2409.07343, 2024

arXiv 2024

[13] [13]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn , N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action ﬂow model fo r general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[14] [14]

Braun, N

M. Braun, N. Jaquier, L. Rozo, and T. Asfour. Riemannian ﬂow matching policy for robot mo- tion learning. In 2024 IEEE/RSJ International Conference on Intelligent Rob ots and Systems (IROS), pages 5144–5151. IEEE, 2024

2024

[15] [15]

Zhang and M

F. Zhang and M. Gienger. Affordance-based robot manipu lation with ﬂow matching. arXiv preprint arXiv:2409.01083, 2024

arXiv 2024

[16] [16]

G. Y an, J. Zhu, Y . Deng, S. Y ang, R.-Z. Qiu, X. Cheng, M. Me mmel, R. Krishna, A. Goyal, X. Wang, and D. Fox. ManiFlow: A general robot manipulation p olicy via consistency ﬂow training. In Conference on Robot Learning (CoRL) , 2025

2025

[17] [17]

Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He. Mean ﬂows for one-step generative modeling. arXiv preprint arXiv:2505.13447 , 2025

Pith/arXiv arXiv 2025

[18] [18]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learn ing to generate and transfer data with rectiﬁed ﬂow. 2023. URL https://openreview.net/forum?id=XVjTT1nw5z. 9

2023

[19] [19]

L. Dinh, D. Krueger, and Y . Bengio. Nice: Non-linear ind ependent components estimation. arXiv preprint arXiv:1410.8516 , 2014

Pith/arXiv arXiv 2014

[20] [20]

T. Y u, D. Quillen, Z. He, R. C. Julian, K. Hausman, C. Finn , and S. Levine. Meta-world: A benchmark and evaluation for multi -task and meta reinforcement learning. In Conference on Robot Learning , 2019. URL https://api.semanticscholar.org/CorpusID:204852201

2019

[21] [21]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P . Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310 , 2023

Pith/arXiv arXiv 2023

[22] [22]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng , W . Ding, C. Gao, C. Ge, W . Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Lu o, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P . Wang, P. ...

Pith/arXiv arXiv 2025

[23] [23]

Community

S. Community. Starvla: A lego-like codebase for vision -language-action model developing. arXiv preprint arXiv:2604.05014 , 2026

Pith/arXiv arXiv 2026

[24] [24]

Physical Intelligence, Black, N

K. Physical Intelligence, Black, N. Brown, D. James, D. Karan, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0. 5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025. 10

Pith/arXiv arXiv 2025