pith. sign in

arxiv: 2605.30011 · v1 · pith:AVE6AXUSnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Pith reviewed 2026-06-29 08:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual intermediate reasoningvision-language-actionlow-latency policiesvisual evidence tokensselective routingembodied controlVisualEvidence-Set
0
0 comments X

The pith

VisualThink-VLA replaces textual chain-of-thought with compact visual evidence tokens and selective routing to reach top VLA success rates at sub-second latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that textual intermediate reasoning harms vision-language-action policies because irrelevant text interferes with action prediction and autoregressive decoding creates multi-second delays unsuitable for real-time robot control. VISUALTHINK-VLA instead guides actions via a compact visual-evidence interface that keeps spatial precision without decoding overhead, plus a selective routing mechanism that learns which visual tokens to use. The approach also supplies the VisualEvidence-Kit and a 754.7k-instruction VisualEvidence-Set to train and audit the routing. On benchmarks including BridgeData V2 and real-robot tests, the method matches or exceeds prior success rates while cutting step latency from 8.377 seconds to 0.367 seconds.

Core claim

VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead; it adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization; and it supplies the VisualEvidence-Kit centered on a VisualEvidence-Agent that builds the 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual tests.

What carries the argument

Compact visual-evidence interface plus selective routing mechanism, supervised by the VisualEvidence-Set.

If this is right

  • Highest success rates on most evaluated benchmarks and real-robot settings.
  • Step latency reduced to the sub-second regime, for example 22.8 times faster on BridgeData V2.
  • Enables real-time closed-loop execution that textual chain-of-thought cannot support.
  • VisualEvidence-Kit provides reusable supervision and audit data for similar routing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visual-token routing could reduce latency in other spatial control tasks where text adds noise.
  • If the VisualEvidence-Set construction generalizes, similar agent-based supervision might speed up training for non-VLA multimodal policies.
  • Selective routing over visual evidence may offer a template for balancing model capacity and speed in any setting where full decoding is costly.

Load-bearing premise

A compact visual-evidence interface plus selective routing can keep high-capacity specialization without the interference or latency of textual reasoning, and the VisualEvidence-Set supplies faithful supervision for route learning.

What would settle it

On a held-out VLA benchmark or robot task, VISUALTHINK-VLA shows lower success rates than strong textual-reasoning baselines or step latency remains above one second while accuracy stays matched.

read the original abstract

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VISUALTHINK-VLA, a visual intermediate-reasoning framework for vision-language-action (VLA) policies. It replaces textual chain-of-thought with a compact visual-evidence interface that preserves spatial precision and avoids autoregressive decoding latency, combined with a selective routing mechanism to learn visual evidence tokens while preserving specialization. The authors also present VisualEvidence-Kit, built around a VisualEvidence-Agent that produces a 754.7k-instruction VisualEvidence-Set used for route supervision and counterfactual faithfulness tests. Across benchmarks and real-robot evaluations, the method is reported to achieve the highest success rates on most tasks while reducing step latency from multi-second (e.g., 8.377 s with ECoT on BridgeData V2) to sub-second regimes (0.367 s, 22.8× speedup).

Significance. If the central claims are substantiated, the work would represent a meaningful advance for real-time embodied control. Replacing textual reasoning with a visual-evidence interface directly targets the latency and interference problems that currently limit reasoning-augmented VLAs in closed-loop settings. The VisualEvidence-Kit, if released with the claimed scale and audit capabilities, would constitute a reusable resource for the community. The selective-routing design offers a concrete mechanism for trading off capacity and speed without full model duplication.

major comments (2)
  1. [Abstract and experimental results] Abstract and §4 (experimental results): the reported 22.8× latency reduction and top success rates on BridgeData V2 and other benchmarks are presented as aggregate outcomes. No ablation is described that disables the selective routing module while retaining the visual-evidence interface and base VLA backbone; without this isolation, the performance gains cannot be attributed to the routing mechanism rather than the visual tokens or model capacity alone.
  2. [VisualEvidence-Kit description] §3.2 (VisualEvidence-Kit and VisualEvidence-Set): the 754.7k-set is stated to supply both route supervision and counterfactual faithfulness tests, yet no quantitative faithfulness metric (e.g., route-prediction accuracy on held-out counterfactual pairs or correlation between route correctness and downstream action success) is reported. This leaves the key assumption that the set teaches correct routing rather than spurious correlations unverified.
minor comments (2)
  1. [Abstract] Abstract: the sentence beginning 'Besides, to further improve performance...' is grammatically awkward and could be rephrased for clarity.
  2. [Method overview] Notation: the term 'VisualEvidence-Set' is introduced without an explicit definition of its format or schema; a short table or figure illustrating one example entry would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive evaluation of the potential contributions of VISUALTHINK-VLA. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and §4 (experimental results): the reported 22.8× latency reduction and top success rates on BridgeData V2 and other benchmarks are presented as aggregate outcomes. No ablation is described that disables the selective routing module while retaining the visual-evidence interface and base VLA backbone; without this isolation, the performance gains cannot be attributed to the routing mechanism rather than the visual tokens or model capacity alone.

    Authors: We agree that the manuscript would benefit from an explicit ablation that isolates the selective routing module while retaining the visual-evidence interface and base backbone. Current experiments compare against external baselines that lack both components, but this does not fully disentangle the routing contribution. We will add the requested ablation study (with and without routing) to §4 in the revision. revision: yes

  2. Referee: [VisualEvidence-Kit description] §3.2 (VisualEvidence-Kit and VisualEvidence-Set): the 754.7k-set is stated to supply both route supervision and counterfactual faithfulness tests, yet no quantitative faithfulness metric (e.g., route-prediction accuracy on held-out counterfactual pairs or correlation between route correctness and downstream action success) is reported. This leaves the key assumption that the set teaches correct routing rather than spurious correlations unverified.

    Authors: We acknowledge that no quantitative faithfulness metrics (such as route-prediction accuracy on held-out counterfactual pairs or correlation with action success) are reported in the current manuscript, even though the set is used for supervision and tests. We will add these metrics to §3.2 in the revision to substantiate the routing quality. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical claims rest on benchmark results without self-referential reductions or fitted inputs.

full rationale

The paper introduces VISUALTHINK-VLA with a visual-evidence interface and selective routing, evaluated via success rates and latency metrics (e.g., BridgeData V2: 0.367s vs. 8.377s). No equations, parameter fits renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The VisualEvidence-Set and VisualEvidence-Kit are presented as new resources for supervision, not as circular inputs. The derivation chain is self-contained through experimental reporting rather than reducing to its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or background assumptions; ledger is empty.

pith-pipeline@v0.9.1-grok · 5837 in / 1055 out tokens · 23756 ms · 2026-06-29T08:10:15.437077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 27 canonical work pages · 19 internal anchors

  1. [1]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  2. [2]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 16

  3. [3]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

  6. [6]

    Fast ecot: Efficient embodied chain-of-thought via thoughts reuse (2025).arXiv preprint arXiv:2506.07639

    Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, and Chris Xiaoxuan Lu. Fast ecot: Efficient embodied chain-of-thought via thoughts reuse (2025).arXiv preprint arXiv:2506.07639

  7. [7]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  8. [8]

    Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277–54296, 2025

  9. [9]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  10. [10]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy, 2025.URL https://arxiv. org/abs/2510.13778

  11. [11]

    DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

    Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, and Zhouping Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXiv preprint arXiv:2511.15669, 2025

  12. [12]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  13. [13]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  14. [14]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

  15. [15]

    Yang, S., Li, G., and Yu, Y

    Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vulić. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

  16. [16]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  17. [17]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  18. [18]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  19. [19]

    Eraser: A benchmark to evaluate rationalized nlp models

    Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. Eraser: A benchmark to evaluate rationalized nlp models. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, 2020

  20. [20]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  21. [21]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023. 17

  22. [22]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  23. [23]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  24. [24]

    Perceiver-actor: Amulti-tasktransformerforroboticmanipulation

    MohitShridhar, LucasManuelli, andDieterFox. Perceiver-actor: Amulti-tasktransformerforroboticmanipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

  25. [25]

    Viola: Imitation learning for vision-based manipulation with object proposal priors

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. InConference on Robot Learning, pages 1199–1210. PMLR, 2023

  26. [26]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  27. [27]

    Hydra: Hybrid robot actions for imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. InConference on Robot Learning, pages 2113–2133. PMLR, 2023

  28. [28]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pages 28085–28128, 2025

  29. [29]

    LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, and Beng Chin Ooi. Lmms meet object-centric vision: Understanding, segmentation, editing and generation.arXiv preprint arXiv:2604.11789, 2026

  30. [30]

    Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

    Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, et al. Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

  31. [31]

    Videorefer suite: Advancing spatial-temporal object understanding with video llm

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

  32. [32]

    Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

    Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

  33. [33]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

  34. [34]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.Ar...

  35. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  36. [36]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  37. [37]

    Simple open-vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. InEuropean conference on computer vision, pages 728–755. Springer, 2022

  38. [38]

    Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025

    Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, and Beng Chin Ooi. Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025. 18

  39. [39]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

  40. [40]

    Hyperllava: Dynamic visual and language expert tuning for multimodal large language models

    Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, et al. Hyperllava: Dynamic visual and language expert tuning for multimodal large language models. arXiv preprint arXiv:2403.13447, 2024

  41. [41]

    The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models

    Noah Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 530–546, 2024

  42. [42]

    Counterfactuals as a means for evaluating faithfulness of attribution methods in autoregressive language models

    Sepehr Kamahi and Yadollah Yaghoobzadeh. Counterfactuals as a means for evaluating faithfulness of attribution methods in autoregressive language models. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 452–468, 2024

  43. [43]

    Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation

    Wenqiao Zhang, Lei Zhu, James Hallinan, Shengyu Zhang, Andrew Makmur, Qingpeng Cai, and Beng Chin Ooi. Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20666–20676, 2022

  44. [44]

    Learning in imperfect environment: Multi-label classification with long-tailed distribution and partial labels

    Wenqiao Zhang, Changshuo Liu, Lingze Zeng, Bengchin Ooi, Siliang Tang, and Yueting Zhuang. Learning in imperfect environment: Multi-label classification with long-tailed distribution and partial labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1423–1432, 2023

  45. [45]

    Revisiting the domain shift and sample uncertainty in multi-source active domain transfer

    Wenqiao Zhang, Zheqi Lv, Hao Zhou, Jia-Wei Liu, Juncheng Li, Mengze Li, Yunfei Li, Dongping Zhang, Yueting Zhuang, and Siliang Tang. Revisiting the domain shift and sample uncertainty in multi-source active domain transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16751–16761, 2024

  46. [46]

    Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

  47. [47]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  48. [48]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018

  49. [49]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  50. [50]

    Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023

    Rutav Shah, Roberto Martín-Martín, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023

  51. [51]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi-0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  52. [52]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  53. [53]

    Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024. 19