hub Canonical reference

pi rl: Online rl fine-tuning for flow-based vision-language-action mod- els.arXiv preprint arXiv:2510.25889

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin · 2025 · arXiv 2510.25889

Canonical reference. 73% of citing Pith papers cite this work as background.

22 Pith papers citing it

Background 73% of classified citations

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 method 1 other 1

citation-polarity summary

background 8 baseline 1 unclear 1 use method 1

representative citing papers

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

cs.RO · 2026-04-13 · unverdicted · novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.

Z-1: Efficient Reinforcement Learning for Vision-Language-Action Models

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Z-1 uses task-wise GRPO post-training on a flow-based VLA model to reach 80.6% average success across 24 RoboCasa tasks, a 13.2-point gain over its SFT baseline.

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

Agentic-VLA enables efficient online adaptation of VLA models, delivering +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and 2.4x faster convergence on LIBERO through three new components.

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

Reinforcing VLAs in Task-Agnostic World Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

cs.RO · 2026-05-10 · unverdicted · novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.

Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

cs.RO · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

RL Token: Bootstrapping Online RL with Vision-Language-Action Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

cs.RO · 2026-04-11 · unverdicted · novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

RISE: Self-Improving Robot Policy with Compositional World Model

cs.RO · 2026-02-11 · unverdicted · novelty 6.0

RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

cs.RO · 2026-02-11 · unverdicted · novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

cs.RO · 2026-05-28 · unverdicted · novelty 5.0

Analysis reveals Pi-GCRL degradation in contact-rich tasks due to hybrid dynamics; contact-aware and hierarchical formulations are proposed to extend it to manipulation.

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

cs.RO · 2026-05-28 · unverdicted · novelty 5.0

BORA combines offline RL critic training with online chunk-wise residual adaptation to raise average success rates of real-world dexterous VLA policies by 33% and up to 43% on unseen objects across five tasks.

Trust Region Q Adjoint Matching

cs.LG · 2026-05-26 · unverdicted · novelty 4.0

TRQAM adds a trust region to QAM by optimizing λ in SOC dynamics to achieve closed-form control of path-space KL, yielding 68% success rate on 50 OGBench tasks versus 46% for the strongest baseline.

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

cs.RO · 2026-05-25 · unverdicted · novelty 4.0

EXPO-FT enables pretrained VLA policies to reach 30/30 success on complex manipulation tasks using an average of 19.1 minutes of online robot data while outperforming prior RL approaches.

OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

cs.RO · 2026-04-20 · unverdicted · novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies cs.RO · 2026-05-12 · unverdicted · none · ref 25
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

pi rl: Online rl fine-tuning for flow-based vision-language-action mod- els.arXiv preprint arXiv:2510.25889

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer