Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

Jiakang Yuan; Jiaxin Wang; Siyao Chen; Tao Chen

arxiv: 2606.29892 · v1 · pith:BMF3H23Unew · submitted 2026-06-29 · 💻 cs.RO · cs.AI

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

Siyao Chen , Jiakang Yuan , Jiaxin Wang , Tao Chen This is my paper

Pith reviewed 2026-06-30 05:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-action modelstest-time reinforcement learningintrinsic rewardconfidence-driven learningroboticsself-bootstrappingpolicy improvement

0 comments

The pith

Vision-language-action models can self-improve at test time by treating high-confidence trajectories as intrinsic rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that discrete-action VLAs generate trajectories whose confidence scores already correlate with actual task success. It introduces T^2VLA, a test-time reinforcement learning procedure that replaces external rewards with a similarity metric to the model's own high-confidence demonstrations. A dual-expert mechanism keeps one local pseudo-expert for exploration while a global pool maintains training stability. Experiments on LIBERO and RoboTwin demonstrate gains over supervised baselines that approach the results of oracle RL supplied with ground-truth rewards. The same procedure works for both OpenVLA-OFT and the pi-series architectures.

Core claim

T^2VLA performs test-time policy improvement in VLAs by using trajectory-level similarity to high-confidence expert demonstrations as an intrinsic reward signal, together with a Confidence-Driven Dual Expert Bootstrapping mechanism that dynamically balances a Local Pseudo-Expert for exploration against a Global Expert Pool for stability, thereby achieving effective learning without any external environmental feedback.

What carries the argument

Confidence-Driven Dual Expert Bootstrapping mechanism that generates intrinsic rewards from model confidence and balances local exploration with global stability.

If this is right

Outperforms supervised imitation-learning baselines on the LIBERO and RoboTwin benchmarks.
Approaches the performance level of oracle RL that receives ground-truth rewards.
Operates without external reward feedback while still producing measurable policy improvement.
Transfers across distinct VLA architectures including OpenVLA-OFT and the pi series.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence-as-reward idea may extend to other autoregressive generation settings where success is hard to measure externally.
Test-time self-bootstrapping could lower the data-collection cost of training embodied agents by reusing the model's own outputs.
If confidence tracks success reliably, future VLAs might incorporate lightweight test-time updates as a standard deployment step.

Load-bearing premise

Similarity to high-confidence trajectories reliably indicates task success when no external reward is available.

What would settle it

A controlled test in which high-confidence trajectories are systematically unsuccessful yet the method still reports policy gains.

Figures

Figures reproduced from arXiv: 2606.29892 by Jiakang Yuan, Jiaxin Wang, Siyao Chen, Tao Chen.

**Figure 2.** Figure 2: Overview of the T2VLA Framework. Our pipeline autonomously mines behavioral anchors by identifying a Local Pseudo-Expert from exploratory rollouts and maintaining a Global Expert Pool. These references are integrated via a DTW-based Hybrid Similarity Reward to compute advantages, enabling continuous policy optimization without external reward signals. confidence-driven dual-expert mechanism to autonomous… view at source ↗

**Figure 3.** Figure 3: Correlation between VLA generation confidence and task success [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of Expert Pool Capacity (K). Evaluated on LIBERO-Long, where K = 5 yields the best performance. Overall, these results indicate that the local and global experts are highly complementary: the former steers the policy toward newly discovered successful modes, while the latter serves as a stabilizing anchor to prevent policy degradation when current rollouts are poor. Their synergy yields a balanced… view at source ↗

**Figure 5.** Figure 5: Trajectory alignment comparison. In-situ training trajectories projected onto the principal action axis. (a) Rigid Euclidean matching yields high residual errors under temporal shifts. (b) DTW dynamically warps the time axis to map structurally similar states, recovering a robust spatial similarity measure. geometric-aware matching prevents high-quality exploratory rollouts from being assigned spuriously l… view at source ↗

**Figure 6.** Figure 6: Generalization of confidence–success relationships. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Expert election with the continuous-action GR00T policy on [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Confidence–success relationship during policy optimization. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Empirical analysis of pure confidence-based rewards on the LIBERO-Long benchmark. The curves demonstrate that confidence-driven optimization initially boosts the evaluation performance (blue) to around 90%. However, it eventually plateaus and degrades below the 86.5% baseline, remaining consistently below the training success rate (orange). This dynamic suggests that a single scalar metric lacks sufficie… view at source ↗

**Figure 10.** Figure 10: Learning dynamics on the LIBERO-Goal (traj1) benchmark. The plot tracks the success rates of exploratory training rollouts (orange) and periodic evaluations (blue). T 2VLA steadily improves the policy from an initial 59.6% to a peak of 83.0% using exclusively intrinsic rewards, with environment signals reserved strictly for monitoring. the training rollout performance and the periodic evaluation success … view at source ↗

**Figure 11.** Figure 11: Ablation of the Dual Expert mechanism on the LIBERO-Goal (traj1) benchmark. The plot contrasts the evaluation success rates of the Local Expert Only (orange) and Global Expert Only (green) configurations against our synergistic Dual Expert approach (blue). The initial SFT baseline is denoted by the dashed gray line. The dynamic weighting effectively combines the responsiveness of the local expert with t… view at source ↗

**Figure 12.** Figure 12: Exploratory rollouts in a high-competence regime (base success rate 86.5%). The initial policy ensures all 8 rollouts cover the full task horizon without early termination. The framework selects a structurally complete expert (red line), providing valid anchors for stable policy refinement [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: A failure case visualizing the bootstrapping threshold on the LIBERO-Long suite (1-shot setting). The framework evaluates 8 exploratory rollouts generated by a weak initial model and assigns the highest confidence (Reward = 1.000) to a spatially truncated trajectory (Expert, red line). Trajectories attempting to explore further (e.g., blue, purple) receive lower scores. Bootstrapping from such incomplete … view at source ↗

read the original abstract

Reinforcement learning (RL) has become indispensable for pushing Vision-Language-Action Models (VLAs) beyond static imitation learning. However, existing RL methods typically require external environmental feedback, relying on predefined success signals to guide policy updates. In this work, we show that VLA models possess useful internal evaluative capabilities: in discrete-action VLAs, trajectories with higher generation confidence are significantly more likely to succeed. Based on this observation, we introduce T^2VLA (Test-time VLA), an architecture-agnostic test-time RL framework that enables VLA models to achieve self-bootstrapping policy improvement. Instead of relying on external rewards, T^2VLA leverages trajectory-level similarity to high-confidence expert demonstrations as an intrinsic reward signal. In addition, we propose a Confidence-Driven Dual Expert Bootstrapping mechanism, which dynamically balances a Local Pseudo-Expert for exploration and a Global Expert Pool for training stability. Extensive experiments on the LIBERO and RoboTwin benchmarks show that T^2VLA consistently outperforms supervised baselines and approaches oracle RL performance with ground-truth rewards, achieving effective improvement without external reward feedback. Furthermore, T^2VLA adapts to distinct VLA paradigms, including both OpenVLA-OFT and the pi series.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T^2VLA uses trajectory similarity to high-confidence demos as an intrinsic reward for test-time VLA improvement and reports gains on LIBERO and RoboTwin, but the abstract leaves the key correlation unverified.

read the letter

The paper's main move is to take an observation about higher-confidence trajectories succeeding more often in discrete-action VLAs and turn it into a test-time RL loop. Instead of external rewards, T^2VLA defines an intrinsic signal from how similar a rollout is to high-confidence expert demonstrations, then adds a dual-expert mechanism that mixes a local pseudo-expert for exploration with a global pool for stability. The experiments claim this gets close to oracle RL performance while beating supervised baselines and works on both OpenVLA-OFT and the pi series.

The concrete results on the two benchmarks are the part that stands out. If the numbers hold up in the full text, the method gives a workable route for self-improvement in settings where reward design is expensive.

The soft spot is exactly where the stress-test note points: the reward is built on similarity rather than raw confidence, yet the motivating observation is scoped to discrete actions and no direct check is described that the similarity metric preserves a positive link to actual task success. Without that link, the bootstrapping loop has no guarantee of improvement. The abstract also gives no numbers on how confidence is computed, what the similarity metric actually is, or any statistical controls, so the central claim cannot be assessed from the provided text. The architecture-agnostic framing stretches beyond the evidence shown.

This is for people working on practical test-time methods for VLAs in robotics. A reader who wants to try reducing external reward dependence would find the benchmark setup useful. It deserves peer review because the idea is straightforward to implement and the experiments use standard suites, even if the reward proxy needs tighter validation.

Referee Report

3 major / 2 minor

Summary. The paper observes that in discrete-action VLAs, higher generation confidence trajectories are significantly more likely to succeed. It introduces T^2VLA, an architecture-agnostic test-time RL method that defines an intrinsic reward via trajectory-level similarity to high-confidence expert demonstrations, augmented by a Confidence-Driven Dual Expert Bootstrapping mechanism (local pseudo-expert for exploration, global expert pool for stability). Experiments on LIBERO and RoboTwin benchmarks claim that T^2VLA outperforms supervised baselines and approaches oracle RL performance with ground-truth rewards, while adapting to OpenVLA-OFT and pi-series VLAs.

Significance. If the central results hold, the work demonstrates a practical route to self-bootstrapping improvement in VLAs at test time without external reward signals or environment feedback. This could reduce dependence on hand-crafted success detectors in robotics deployment. The architecture-agnostic framing and dual-expert design are notable strengths if the similarity proxy is shown to be reliable.

major comments (3)

[§3 (Method)] The motivating observation (higher confidence predicts success) is scoped to discrete-action VLAs, yet the method is presented as architecture-agnostic and applied to multiple VLA paradigms; the manuscript must clarify whether the confidence-success correlation was verified for continuous-action or other paradigms, or whether the similarity metric substitutes without re-validation.
[§3.2 (Reward Definition) and §5 (Experiments)] The intrinsic reward is defined via trajectory-level similarity to high-confidence expert rollouts rather than confidence directly. The central claim that this produces reliable policy improvement requires explicit evidence that the chosen similarity metric (action or latent space) maintains a strong positive correlation with ground-truth task success; without such validation or ablation, the self-bootstrapping loop rests on an unverified proxy.
[§5 (Experiments)] Table or figure reporting benchmark results (LIBERO/RoboTwin) should include statistical controls, data-split details, and confidence intervals; the abstract-level claim of "consistently outperforms" and "approaches oracle RL" cannot be assessed for robustness without these.

minor comments (2)

[§3] Notation for the similarity metric and the dual-expert weighting should be introduced with explicit equations to avoid ambiguity in the bootstrapping mechanism.
[§3.1] Clarify whether the high-confidence expert demonstrations are drawn from the same policy or held-out data, as this affects potential circularity in the reward signal.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and proposed revisions where appropriate. Our responses focus on strengthening the manuscript without misrepresenting the presented results.

read point-by-point responses

Referee: [§3 (Method)] The motivating observation (higher confidence predicts success) is scoped to discrete-action VLAs, yet the method is presented as architecture-agnostic and applied to multiple VLA paradigms; the manuscript must clarify whether the confidence-success correlation was verified for continuous-action or other paradigms, or whether the similarity metric substitutes without re-validation.

Authors: The manuscript explicitly scopes the motivating observation to discrete-action VLAs in both the abstract and §3. The architecture-agnostic claim refers to the overall T²VLA framework (similarity-based intrinsic reward + dual-expert bootstrapping), which does not require direct access to per-token confidence at inference. For continuous-action paradigms (e.g., certain pi-series variants), the similarity metric is used as a direct substitute without re-validating the confidence-success correlation in those settings. We will revise §3 to make this scoping and substitution explicit, including a short note that the correlation verification remains limited to the discrete case examined in the motivating experiments. revision: yes
Referee: [§3.2 (Reward Definition) and §5 (Experiments)] The intrinsic reward is defined via trajectory-level similarity to high-confidence expert rollouts rather than confidence directly. The central claim that this produces reliable policy improvement requires explicit evidence that the chosen similarity metric (action or latent space) maintains a strong positive correlation with ground-truth task success; without such validation or ablation, the self-bootstrapping loop rests on an unverified proxy.

Authors: We agree that the reward relies on trajectory similarity rather than raw confidence and that direct validation of the similarity-success correlation would strengthen the central claim. The current manuscript motivates the proxy via the discrete-action observation but does not include an explicit ablation correlating the chosen similarity metric against ground-truth success. We will add this analysis (e.g., a correlation plot or ablation table) to §5 in the revision. revision: yes
Referee: [§5 (Experiments)] Table or figure reporting benchmark results (LIBERO/RoboTwin) should include statistical controls, data-split details, and confidence intervals; the abstract-level claim of "consistently outperforms" and "approaches oracle RL" cannot be assessed for robustness without these.

Authors: The referee correctly identifies that the reported results lack explicit confidence intervals, multi-seed statistics, and detailed data-split descriptions. We will revise the experimental section and associated tables/figures to include means ± standard deviation over multiple random seeds, clarify the train/test splits used, and add the requested statistical controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external observation and benchmark validation.

full rationale

The paper's core claim rests on an empirical observation (higher generation confidence correlates with success in discrete-action VLAs) used to motivate an intrinsic reward defined as trajectory similarity to high-confidence expert rollouts. This is presented as an architecture-agnostic test-time RL method with experimental results on LIBERO and RoboTwin showing outperformance over supervised baselines. No equations or definitions are provided in the available text that reduce the reward signal or performance gains to the inputs by construction (e.g., no self-referential normalization or fitted parameter renamed as prediction). The bootstrapping mechanism is the intended self-improvement loop, not a definitional equivalence. Self-citations, if present, are not load-bearing for the central result. The derivation is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full derivation details, parameter counts, and any additional axioms are unavailable.

axioms (1)

domain assumption Trajectories with higher generation confidence are significantly more likely to succeed
This correlation is stated as the foundational observation that justifies using confidence-derived similarity as reward.

pith-pipeline@v0.9.1-grok · 5758 in / 1175 out tokens · 28457 ms · 2026-06-30T05:59:47.348086+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 37 canonical work pages · 24 internal anchors

[1]

arXiv preprint arXiv:2512.14666 (2025)

Bai, Z., Gao, C., Shou, M.Z.: Evolve-vla: Test-time training from environment feed- back for vision-language-action models. arXiv preprint arXiv:2512.14666 (2025)

work page arXiv 2025
[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Conference on Robot Learning

Chebotar, Y., Vuong, Q., Hausman, K., Xia, F., Lu, Y., Irpan, A., Kumar, A., Yu, T., Herzog, A., Pertsch, K., et al.: Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In: Conference on Robot Learning. pp. 3909–3928. PMLR (2023)

2023
[7]

Chen,D.,Wang,D.,Darrell,T.,Ebrahimi,S.:Contrastivetest-timeadaptation.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 295–305 (2022)

2022
[8]

arXiv preprint arXiv:2510.25889 (2025)

Chen, K., Liu, Z., Zhang, T., Guo, Z., Xu, S., Lin, H., Zang, H., Li, X., Zhang, Q., Yu, Z., et al.:πRL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models. arXiv preprint arXiv:2510.25889 (2025)

work page arXiv 2025
[9]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Community, S.: Starvla: A lego-like codebase for vision-language-action model de- veloping. arXiv preprint arXiv:2604.05014 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., Gurevych, I.: A survey of confidence estimation and calibration in large language models. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 6577–6595 (2024)

2024
[13]

arXiv preprint arXiv:2509.22643 (2025)

Guo, W., Lu, G., Deng, H., Wu, Z., Tang, Y., Wang, Z.: Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search. arXiv preprint arXiv:2509.22643 (2025)

work page arXiv 2025
[14]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 15665–15672. IEEE (2025)

2025
[15]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Huang, C.P., Wu, Y.H., Chen, M.H., Wang, Y.C.F., Yang, F.E.: Thinkact: Vision- language-action reasoning via reinforced visual latent planning. arXiv preprint arXiv:2507.16815 (2025) Trust Your Instincts: Confidence-Driven Test-Time RL for VLA Models 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

arXiv preprint arXiv:2508.02219 (2025)

Huang, D., Fang, Z., Zhang, T., Li, Y., Zhao, L., Xia, C.: Co-rft: Efficient fine- tuning of vision-language-action models through chunked offline reinforcement learning. arXiv preprint arXiv:2508.02219 (2025)

work page arXiv 2025
[17]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

In: Forty-first International Conference on Machine Learning (2024)

Karamcheti, S., Nair, S., Balakrishna, A., Liang, P., Kollar, T., Sadigh, D.: Pris- matic vlms: Investigating the design space of visually-conditioned language models. In: Forty-first International Conference on Machine Learning (2024)

2024
[19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

arXiv preprint arXiv:2508.06266 (2025)

Li, Z., Yang, R., Chen, R., Luo, Z., Chen, L.: Adpro: a test-time adaptive diffusion policy via manifold-constrained denoising and task-aware initialization for robotic manipulation. arXiv preprint arXiv:2508.06266 (2025)

work page arXiv 2025
[23]

Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

Li, Z., Liu, J., Dong, Z., Teng, T., Rouxel, Q., Caldwell, D., Chen, F.: Towards deploying vla without fine-tuning: Plug-and-play inference-time vla policy steering via embodied evolutionary diffusion. arXiv preprint arXiv:2511.14178 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

2023
[25]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[26]

arXiv preprint arXiv:2602.03973 (2026)

Liu, S., Singh, I.S., Xu, Y., Duan, J., Krishna, R.: Vls: Steering pretrained robot policies via vision-language models. arXiv preprint arXiv:2602.03973 (2026)

work page arXiv 2026
[27]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Lu, G., Guo, W., Zhang, C., Zhou, Y., Jiang, H., Gao, Z., Tang, Y., Wang, Z.: Vla- rl: Towards masterful and general robotic manipulation with scalable reinforcement learning. arXiv preprint arXiv:2505.18719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., Fei-Fei, L., Savarese, S., Zhu, Y., Martín-Martín, R.: What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

arXiv preprint arXiv:2410.13816 (2024)

Nakamoto, M., Mees, O., Kumar, A., Levine, S.: Steering your generalists: Improv- ing robotic foundation models via value guidance. arXiv preprint arXiv:2410.13816 (2024)

work page arXiv 2024
[31]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

arXiv preprint arXiv:2508.12211 (2025) 18 Chen et al

Neary, C., Younis, O.G., Kuramshin, A., Aslan, O., Berseth, G.: Improving pre- trained vision-language-action policies with model-based search. arXiv preprint arXiv:2508.12211 (2025) 18 Chen et al

work page arXiv 2025
[33]

In: Findings of the Association for Computational Linguistics: ACL 2025

Nguyen, D., Payani, A., Mirzasoleiman, B.: Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 4530–4540 (2025)

2025
[34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Peng, T., Li, M., Yuan, J., Zhou, H., Xia, R., Zhang, R., Bai, L., Mao, S., Wang, B., Zhou, A., et al.: Chimera: Improving generalist model with domain-specific experts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3011–3022 (2025)

2025
[35]

In: Proceedings of the 18th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining

Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining. pp. 262–270 (2012)

2012
[36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Google AI1, 11 (2025)

Silver, D., Sutton, R.S.: Welcome to the era of experience. Google AI1, 11 (2025)

2025
[38]

arXiv preprint arXiv:2506.09684 (2025)

Song, H., Ji, R., Shi, N., Lai, F., Kontar, R.A.: Inv-entropy: A fully probabilis- tic framework for uncertainty quantification in language models. arXiv preprint arXiv:2506.09684 (2025)

work page arXiv 2025
[39]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Song, W., Zhao, H., Ding, P., Cui, C., Lyu, S., Fan, Y., Wang, D.: Germ: A generalist robotic model with mixture-of-experts for quadruped robot. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11879–11886. IEEE (2024)

2024
[40]

Interactive Post-Training for Vision-Language-Action Models

Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Tent: Fully Test-time Adaptation by Entropy Minimization

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[43]

arXiv preprint arXiv:2512.02834 (2025)

Yang, S., Zhang, Y., He, H., Pan, L., Li, X., Bai, C., Li, X.: Steering vision- language-action models as anti-exploration: A test-time scaling approach. arXiv preprint arXiv:2512.02834 (2025)

work page arXiv 2025
[44]

Yuan, J., Zhang, B., Gong, K., Yue, X., Shi, B., Qiao, Y., Chen, T.: Reg-tta3d: Betterregressionmakesbettertest-timeadaptive3dobjectdetection.In:European conference on computer vision. pp. 197–213. Springer (2024)

2024
[45]

Rlinf-vla: A unified and efficient framework for vla+ rl training

Zang, H., Wei, M., Xu, S., Wu, Y., Guo, Z., Wang, Y., Lin, H., Shi, L., Xie, Y., Xu, Z., et al.: Rlinf-vla: A unified and efficient framework for vla+ rl training. arXiv preprint arXiv:2510.06710 (2025)

work page arXiv 2025
[46]

arXiv preprint arXiv:2505.07395 (2025)

Zhang, H., Zhuang, Z., Zhao, H., Ding, P., Lu, H., Wang, D.: Reinbot: Amplifying robot visual-language manipulation with reinforcement learning. arXiv preprint arXiv:2505.07395 (2025)

work page arXiv 2025
[47]

arXiv preprint arXiv:2411.19309 (2024)

Zhang, Z., Zheng, K., Chen, Z., Jang, J., Li, Y., Han, S., Wang, C., Ding, M., Fox, D., Yao, H.: Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309 (2024)

work page arXiv 2024
[48]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024) Trust Your Instincts: Confidence-Driven Test-Time RL for VLA Models 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

2023
[51]

TTRL: Test-Time Reinforcement Learning

Zuo, Y., Zhang, K., Sheng, L., Qu, S., Cui, G., Zhu, X., Li, H., Zhang, Y., Long, X., Hua, E., et al.: Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084 (2025) 20 Chen et al. Appendix A Overview This appendix provides additional technical details and experimental results for T2VLA. The content is organized as follows: –Section B: Det...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

We additionally evaluate OpenVLA-OFT using action-conditioned observa- tions synthesized by an OpenSora world model [49]

In particular, the relationship remains visible on LIBERO-10 even when the initial success rate is only approximately17%, indicating that the confidence ordering is not limited to already strong policies. We additionally evaluate OpenVLA-OFT using action-conditioned observa- tions synthesized by an OpenSora world model [49]. As shown in Figure 6(b), highe...

[1] [1]

arXiv preprint arXiv:2512.14666 (2025)

Bai, Z., Gao, C., Shou, M.Z.: Evolve-vla: Test-time training from environment feed- back for vision-language-action models. arXiv preprint arXiv:2512.14666 (2025)

work page arXiv 2025

[2] [2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

In: Conference on Robot Learning

Chebotar, Y., Vuong, Q., Hausman, K., Xia, F., Lu, Y., Irpan, A., Kumar, A., Yu, T., Herzog, A., Pertsch, K., et al.: Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In: Conference on Robot Learning. pp. 3909–3928. PMLR (2023)

2023

[7] [7]

Chen,D.,Wang,D.,Darrell,T.,Ebrahimi,S.:Contrastivetest-timeadaptation.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 295–305 (2022)

2022

[8] [8]

arXiv preprint arXiv:2510.25889 (2025)

Chen, K., Liu, Z., Zhang, T., Guo, Z., Xu, S., Lin, H., Zang, H., Li, X., Zhang, Q., Yu, Z., et al.:πRL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models. arXiv preprint arXiv:2510.25889 (2025)

work page arXiv 2025

[9] [9]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Community, S.: Starvla: A lego-like codebase for vision-language-action model de- veloping. arXiv preprint arXiv:2604.05014 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., Gurevych, I.: A survey of confidence estimation and calibration in large language models. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 6577–6595 (2024)

2024

[13] [13]

arXiv preprint arXiv:2509.22643 (2025)

Guo, W., Lu, G., Deng, H., Wu, Z., Tang, Y., Wang, Z.: Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search. arXiv preprint arXiv:2509.22643 (2025)

work page arXiv 2025

[14] [14]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 15665–15672. IEEE (2025)

2025

[15] [15]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Huang, C.P., Wu, Y.H., Chen, M.H., Wang, Y.C.F., Yang, F.E.: Thinkact: Vision- language-action reasoning via reinforced visual latent planning. arXiv preprint arXiv:2507.16815 (2025) Trust Your Instincts: Confidence-Driven Test-Time RL for VLA Models 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

arXiv preprint arXiv:2508.02219 (2025)

Huang, D., Fang, Z., Zhang, T., Li, Y., Zhao, L., Xia, C.: Co-rft: Efficient fine- tuning of vision-language-action models through chunked offline reinforcement learning. arXiv preprint arXiv:2508.02219 (2025)

work page arXiv 2025

[17] [17]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

In: Forty-first International Conference on Machine Learning (2024)

Karamcheti, S., Nair, S., Balakrishna, A., Liang, P., Kollar, T., Sadigh, D.: Pris- matic vlms: Investigating the design space of visually-conditioned language models. In: Forty-first International Conference on Machine Learning (2024)

2024

[19] [19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

arXiv preprint arXiv:2508.06266 (2025)

Li, Z., Yang, R., Chen, R., Luo, Z., Chen, L.: Adpro: a test-time adaptive diffusion policy via manifold-constrained denoising and task-aware initialization for robotic manipulation. arXiv preprint arXiv:2508.06266 (2025)

work page arXiv 2025

[23] [23]

Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

Li, Z., Liu, J., Dong, Z., Teng, T., Rouxel, Q., Caldwell, D., Chen, F.: Towards deploying vla without fine-tuning: Plug-and-play inference-time vla policy steering via embodied evolutionary diffusion. arXiv preprint arXiv:2511.14178 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

2023

[25] [25]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023

[26] [26]

arXiv preprint arXiv:2602.03973 (2026)

Liu, S., Singh, I.S., Xu, Y., Duan, J., Krishna, R.: Vls: Steering pretrained robot policies via vision-language models. arXiv preprint arXiv:2602.03973 (2026)

work page arXiv 2026

[27] [27]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Lu, G., Guo, W., Zhang, C., Zhou, Y., Jiang, H., Gao, Z., Tang, Y., Wang, Z.: Vla- rl: Towards masterful and general robotic manipulation with scalable reinforcement learning. arXiv preprint arXiv:2505.18719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., Fei-Fei, L., Savarese, S., Zhu, Y., Martín-Martín, R.: What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

arXiv preprint arXiv:2410.13816 (2024)

Nakamoto, M., Mees, O., Kumar, A., Levine, S.: Steering your generalists: Improv- ing robotic foundation models via value guidance. arXiv preprint arXiv:2410.13816 (2024)

work page arXiv 2024

[31] [31]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

arXiv preprint arXiv:2508.12211 (2025) 18 Chen et al

Neary, C., Younis, O.G., Kuramshin, A., Aslan, O., Berseth, G.: Improving pre- trained vision-language-action policies with model-based search. arXiv preprint arXiv:2508.12211 (2025) 18 Chen et al

work page arXiv 2025

[33] [33]

In: Findings of the Association for Computational Linguistics: ACL 2025

Nguyen, D., Payani, A., Mirzasoleiman, B.: Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 4530–4540 (2025)

2025

[34] [34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Peng, T., Li, M., Yuan, J., Zhou, H., Xia, R., Zhang, R., Bai, L., Mao, S., Wang, B., Zhou, A., et al.: Chimera: Improving generalist model with domain-specific experts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3011–3022 (2025)

2025

[35] [35]

In: Proceedings of the 18th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining

Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining. pp. 262–270 (2012)

2012

[36] [36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Google AI1, 11 (2025)

Silver, D., Sutton, R.S.: Welcome to the era of experience. Google AI1, 11 (2025)

2025

[38] [38]

arXiv preprint arXiv:2506.09684 (2025)

Song, H., Ji, R., Shi, N., Lai, F., Kontar, R.A.: Inv-entropy: A fully probabilis- tic framework for uncertainty quantification in language models. arXiv preprint arXiv:2506.09684 (2025)

work page arXiv 2025

[39] [39]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Song, W., Zhao, H., Ding, P., Cui, C., Lyu, S., Fan, Y., Wang, D.: Germ: A generalist robotic model with mixture-of-experts for quadruped robot. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11879–11886. IEEE (2024)

2024

[40] [40]

Interactive Post-Training for Vision-Language-Action Models

Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Tent: Fully Test-time Adaptation by Entropy Minimization

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006

[43] [43]

arXiv preprint arXiv:2512.02834 (2025)

Yang, S., Zhang, Y., He, H., Pan, L., Li, X., Bai, C., Li, X.: Steering vision- language-action models as anti-exploration: A test-time scaling approach. arXiv preprint arXiv:2512.02834 (2025)

work page arXiv 2025

[44] [44]

Yuan, J., Zhang, B., Gong, K., Yue, X., Shi, B., Qiao, Y., Chen, T.: Reg-tta3d: Betterregressionmakesbettertest-timeadaptive3dobjectdetection.In:European conference on computer vision. pp. 197–213. Springer (2024)

2024

[45] [45]

Rlinf-vla: A unified and efficient framework for vla+ rl training

Zang, H., Wei, M., Xu, S., Wu, Y., Guo, Z., Wang, Y., Lin, H., Shi, L., Xie, Y., Xu, Z., et al.: Rlinf-vla: A unified and efficient framework for vla+ rl training. arXiv preprint arXiv:2510.06710 (2025)

work page arXiv 2025

[46] [46]

arXiv preprint arXiv:2505.07395 (2025)

Zhang, H., Zhuang, Z., Zhao, H., Ding, P., Lu, H., Wang, D.: Reinbot: Amplifying robot visual-language manipulation with reinforcement learning. arXiv preprint arXiv:2505.07395 (2025)

work page arXiv 2025

[47] [47]

arXiv preprint arXiv:2411.19309 (2024)

Zhang, Z., Zheng, K., Chen, Z., Jang, J., Li, Y., Han, S., Wang, C., Ding, M., Fox, D., Yao, H.: Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309 (2024)

work page arXiv 2024

[48] [48]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024) Trust Your Instincts: Confidence-Driven Test-Time RL for VLA Models 19

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

2023

[51] [51]

TTRL: Test-Time Reinforcement Learning

Zuo, Y., Zhang, K., Sheng, L., Qu, S., Cui, G., Zhu, X., Li, H., Zhang, Y., Long, X., Hua, E., et al.: Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084 (2025) 20 Chen et al. Appendix A Overview This appendix provides additional technical details and experimental results for T2VLA. The content is organized as follows: –Section B: Det...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

We additionally evaluate OpenVLA-OFT using action-conditioned observa- tions synthesized by an OpenSora world model [49]

In particular, the relationship remains visible on LIBERO-10 even when the initial success rate is only approximately17%, indicating that the confidence ordering is not limited to already strong policies. We additionally evaluate OpenVLA-OFT using action-conditioned observa- tions synthesized by an OpenSora world model [49]. As shown in Figure 6(b), highe...