Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data
Pith reviewed 2026-06-27 18:29 UTC · model grok-4.3
The pith
Embodied trajectory-coupled data bridges VLMs to generalizable VLAs through gradual three-stage adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-language models can be turned into generalizable vision-language-action policies by using embodied trajectory-coupled data as a stepping stone that shares robot visual context while retaining language-understanding objectives; this enables a three-stage process of distribution bridging, objective bridging, and retentive adaptation, and mixing task-relevant out-of-distribution ETC data with limited action data transfers VLM generalizations into robust policies that succeed on novel conditions without additional demonstrations.
What carries the argument
Embodied trajectory-coupled (ETC) data: vision-language supervision derived from the same robot scenes and trajectories used for action learning.
If this is right
- The model generalizes to novel visual-language conditions using only small amounts of action data.
- Gradual bridging across distribution and objective gaps prevents degradation of VLM representations.
- Three distinct stages are needed: first adapting to embodied visuals, then shifting objectives, then specializing to deployment.
- Task-relevant out-of-distribution ETC data is effective for enabling generalization without new robot demonstrations.
Where Pith is reading between the lines
- The same gradual bridging approach could lower the total volume of robot demonstrations required to deploy capable policies across varied tasks.
- Similar intermediate data forms might help adapt other pre-trained models when their output modality changes.
- Experiments that vary how closely the ETC data matches the deployment visuals would show how domain-specific the data must be.
Load-bearing premise
That ETC data shares visual context while retaining familiar language objectives and thus acts as a natural stepping stone that preserves rather than degrades VLM representations during the transition to action prediction.
What would settle it
Training a model with the three-stage process but without mixing task-relevant out-of-distribution ETC data and checking whether it still generalizes to novel visual-language conditions at the same rate as the mixed version.
read the original abstract
Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that direct fine-tuning of VLMs on robot action data is hindered by simultaneous gaps in visual distribution and training objective, leading to degraded generalization. It introduces embodied trajectory-coupled (ETC) data—vision-language supervision from the same robot scenes and trajectories—as an intermediate bridge. A three-stage recipe is proposed: Distribution Bridging adapts the VLM to embodied visual-language semantics; Objective Bridging gradually shifts toward action prediction while preserving representations; Retentive Adaptation specializes to the target domain. Experiments show that mixing task-relevant out-of-distribution ETC data with limited action data enables generalization to novel visual-language conditions without extra robot demonstrations, confirmed in simulation and real-robot settings.
Significance. If the empirical results hold, the work provides a concrete, data-efficient recipe for transferring VLM generalization to deployable VLAs. The emphasis on gradual bridging via ETC data and the mixing strategy for OOD generalization addresses a practical bottleneck in embodied AI, potentially lowering the data requirements for robust robot policies. The staged approach is a strength if the preservation of pretrained representations is demonstrated.
major comments (2)
- [§3] §3 (three-stage recipe): the claim that Objective Bridging 'gradually shifts' the model while preserving representations requires explicit ablation showing that skipping this stage degrades performance relative to the full recipe; without such controls the necessity of the intermediate objective remains unproven.
- [Experiments] Experiments section (mixing results): the reported generalization to novel conditions relies on 'task-relevant out-of-distribution ETC data'; the definition and selection criteria for what counts as 'task-relevant' must be stated precisely, as overly broad selection could inflate the apparent benefit of mixing.
minor comments (2)
- [Abstract, §2] Abstract and §2: the term 'embodied trajectory-coupled (ETC) data' is introduced without a formal definition or example of the exact supervision format (e.g., caption style, trajectory encoding); a short illustrative example would improve clarity.
- [Figures] Figure captions (throughout): several figures comparing VLA variants lack error bars or statistical significance markers despite the text claiming 'robust' improvements; adding these would strengthen the presentation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, recognition of the practical value of our approach, and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (three-stage recipe): the claim that Objective Bridging 'gradually shifts' the model while preserving representations requires explicit ablation showing that skipping this stage degrades performance relative to the full recipe; without such controls the necessity of the intermediate objective remains unproven.
Authors: We agree that an explicit ablation is required to substantiate the necessity of the Objective Bridging stage. The revised manuscript will include a new ablation comparing the full three-stage recipe against a two-stage variant that omits Objective Bridging, reporting the resulting degradation in both in-distribution performance and out-of-distribution generalization to confirm that gradual objective shifting is essential for representation preservation. revision: yes
-
Referee: [Experiments] Experiments section (mixing results): the reported generalization to novel conditions relies on 'task-relevant out-of-distribution ETC data'; the definition and selection criteria for what counts as 'task-relevant' must be stated precisely, as overly broad selection could inflate the apparent benefit of mixing.
Authors: We acknowledge the need for a precise definition. In the revised Experiments section we will add an explicit subsection defining task-relevance via two quantitative criteria: (1) semantic overlap measured by cosine similarity of language embeddings between ETC trajectories and target tasks, and (2) visual distribution overlap computed via feature-space distance to the in-distribution robot scenes. This will make the selection process transparent and rule out overly broad inclusion. revision: yes
Circularity Check
No significant circularity; empirical method with no derivations
full rationale
The paper describes an empirical three-stage training recipe (Distribution Bridging, Objective Bridging, Retentive Adaptation) using embodied trajectory-coupled (ETC) data to transition from VLMs to VLAs. No equations, first-principles derivations, fitted parameters, or mathematical reductions are present or claimed. The central claim rests on experimental results from simulation and real-robot tests showing generalization via mixing ETC and action data, without any self-definitional loops, fitted inputs called predictions, or load-bearing self-citations. The approach is self-contained as a practical training strategy validated externally to any internal tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gemini robotics: Bring- ing ai into the physical world
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bring- ing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025
Pith/arXiv arXiv 2025
-
[2]
Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, and Chenyan Xiong. Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training. arXiv preprint arXiv:2604.20012, 2026
Pith/arXiv arXiv 2026
-
[3]
Igniting vlms toward the embodied space
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766, 2025
arXiv 2025
-
[4]
Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution
Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684, 2026
arXiv 2026
-
[5]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[6]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Haus- man, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pe...
Pith/arXiv arXiv 2025
-
[7]
Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067, 2026
arXiv 2026
-
[8]
Chatvla: Unified multimodal understanding and robot control with vision-language-action model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 5377–5395, 2025
2025
-
[9]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery , Brian Ichter, Ayzaan Wahid, Jonathan T ompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
Pith/arXiv arXiv 2023
-
[10]
Robobrain: A unified brain model for robotic manipulation from abstract to concrete
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1724–1734, 2025
2025
-
[11]
Eo-1: An open unified embodied foundation model for general robot control, 2026
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv , Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, and Xuelong Li. Eo-1: An open unified embodied foundation model for general robot control, 2026. URL https://arxiv.org/abs/2508.21112
arXiv 2026
-
[12]
Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report. arXiv preprint arXiv:2507.15493, 2025
Pith/arXiv arXiv 2025
-
[13]
Galaxea open-world dataset and g0 dual-system vla model
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576, 2025
arXiv 2025
-
[14]
Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better
Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better. Advances in Neural Information Processing Systems, 38:102867–102888, 2026
2026
-
[15]
A pragmatic vla foundation model
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026. 10
Pith/arXiv arXiv 2026
-
[16]
𝜋0: A vision-language-action flow model for general robot control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[17]
Gr00t n1: An open foundation model for generalist humanoid robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev , Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[18]
Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization. arXiv preprint arXiv:2512.04952, 2025
arXiv 2025
-
[19]
Vlm4vla: Revisiting vision-language-models in vision-language-action models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026
Pith/arXiv arXiv 2026
-
[20]
Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting
Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky , and Anirudha Majumdar. Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting. arXiv preprint arXiv:2509.22195, 2025
arXiv 2025
-
[21]
Internvla-a1: Unifying understanding, generation and action for robotic manipulation
Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, et al. Internvla-a1: Unifying understanding, generation and action for robotic manipulation. arXiv preprint arXiv:2601.02456, 2026
arXiv 2026
-
[22]
Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R. Sanketi, and Ken Goldberg. Robo2VLM: Visual question answering from large-scale in-the-wild robot manipulation datasets. arXiv preprint arXiv:2505.15517, 2025
arXiv 2025
-
[23]
Robovqa: Multimodal long-horizon reasoning for robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 645–652. IEEE, 2024
2024
-
[24]
Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models
Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355, 2024
2024
-
[25]
Roborefer: T owards spatial referring with reasoning in vision-language models for robotics
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: T owards spatial referring with reasoning in vision-language models for robotics. Advances in Neural Information Processing Systems, 38:28404–28481, 2026
2026
-
[26]
Robopoint: A vision-language model for spatial affordance prediction for robotics
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay , Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024
arXiv 2024
-
[27]
Molmoact: Action reasoning models that can reason in space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 , 2025
Pith/arXiv arXiv 2025
-
[28]
Robobrain 2.0 technical report
BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029, 2025
arXiv 2025
-
[29]
Mimo-embodied: X-embodied foundation model technical report
Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report. arXiv preprint arXiv:2511.16518, 2025
Pith/arXiv arXiv 2025
-
[30]
Unify robot actions in camera frame, 2026
Sicheng Xie, Lingchen Meng, Zijie Diao, Haidong Cao, Zhiying Du, Shuyuan T u, Jiaqi Leng, Qiuyue Wang, Mingsheng Li, Shuai Bai, Zuxuan Wu, and Yu-Gang Jiang. Unify robot actions in camera frame, 2026. URL https://arxiv.org/abs/2511.17001
Pith/arXiv arXiv 2026
-
[31]
Paligemma: A versatile 3b vlm for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov , Xiao Wang, Daniel Salz, Maxim Neu- mann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 11
Pith/arXiv arXiv 2024
-
[32]
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
Pith/arXiv arXiv 2025
-
[33]
Libero: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems , 36:44776–44791, 2023
2023
-
[34]
Evaluating real-world robot manipulation policies in simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024
Pith/arXiv arXiv 2024
-
[35]
Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks
Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142– 11152, 2025
2025
-
[36]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, T ony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023
2023
-
[37]
Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models
Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, and Lin Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models. Advances in Neural Information Processing Systems, 38:136705–136736, 2026
2026
-
[38]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European conference on computer vision , pages 146–162. Springer, 2022
2022
-
[39]
Microsoft coco: Common objects in context
Tsung- Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014
2014
-
[40]
COL", positioned in the center-right of the scene, to the left of the
Lukas Blecher. LaTeX-OCR: pix2tex – using a ViT to convert images of equations into LaTeX code. https://github. com/lukas-blecher/LaTeX-OCR , 2022. Software repository , accessed 2026-05-28. 12 A Implementation Details A.1 Co-training Strategy During both Objective Bridging and Retentive Adaptation, we co-train on ETC and action data. Each opti- mization ...
2022
-
[41]
bike” → “bicycle
The continuous score is exp(− RMSE/ 𝜏) with 𝜏 = 20 px; the example is correct if RMSE ≤ 20 px. D.1.6 COCO Joint Detection F1 For COCO, we use a dataset-specific metric that jointly requires category-label agreement and BBox overlap. Class names are normalized to the 80 COCO categories (e.g., “bike” → “bicycle”). Predictions and ground- truth objects are m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.