pith. machine review for the scientific record. sign in

arxiv: 2601.06748 · v3 · submitted 2026-01-11 · 💻 cs.RO

Recognition: unknown

On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning

Authors on Pith no claims yet
classification 💻 cs.RO
keywords learningreinforcementtt-vlaadaptationduringfine-tuningmodelson-the-fly
0
0 comments X
read the original abstract

Vision-Language-Action models have recently emerged as a powerful paradigm for general-purpose robot learning, enabling agents to map visual observations and natural-language instructions into executable robotic actions. Though popular, they are primarily trained via supervised fine-tuning or training-time reinforcement learning, requiring explicit fine-tuning phases, human interventions, or controlled data collection. Consequently, existing methods remain unsuitable for challenging simulated- or physical-world deployments, where robots must respond autonomously and flexibly to evolving environments. To address this limitation, we introduce a Test-Time Reinforcement Learning for VLAs (TT-VLA), a framework that enables on-the-fly policy adaptation during inference. TT-VLA formulates a dense reward mechanism that leverages step-by-step task-progress signals to refine action policies during test time while preserving the SFT/RL-trained priors, making it an effective supplement to current VLA models. Empirical results show that our approach enhances overall adaptability, stability, and task success in dynamic, previously unseen scenarios under simulated and real-world settings. We believe TT-VLA offers a principled step toward self-improving, deployment-ready VLAs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

    cs.RO 2026-05 unverdicted novelty 7.0

    DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

  2. Test-Time Training for Visual Foresight Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 5.0

    T³VF applies test-time training with adaptive filtering to reduce OOD failures in VF-VLA models by treating predicted future images and actual next observations as natural training pairs.