hub Canonical reference

pi rl: Online rl fine-tuning for flow-based vision-language-action mod- els.arXiv preprint arXiv:2510.25889

Chen, K · 2025 · arXiv 2510.25889

Canonical reference. 73% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 73% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 method 1 other 1

citation-polarity summary

background 8 baseline 1 unclear 1 use method 1

representative citing papers

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

cs.RO · 2026-04-13 · unverdicted · novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.

Z-1: Efficient Reinforcement Learning for Vision-Language-Action Models

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Z-1 uses task-wise GRPO post-training on a flow-based VLA model to reach 80.6% average success across 24 RoboCasa tasks, a 13.2-point gain over its SFT baseline.

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

PolicyTrim is an RL post-training framework that boosts VLA policy efficiency by 3x chunk utilization and 51.4% fewer steps, yielding up to 5.83x speedup.

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

Agentic-VLA enables efficient online adaptation of VLA models, delivering +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and 2.4x faster convergence on LIBERO through three new components.

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

Reinforcing VLAs in Task-Agnostic World Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

cs.RO · 2026-05-10 · unverdicted · novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.

Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

cs.RO · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

RL Token: Bootstrapping Online RL with Vision-Language-Action Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

cs.RO · 2026-04-11 · unverdicted · novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

RISE: Self-Improving Robot Policy with Compositional World Model

cs.RO · 2026-02-11 · unverdicted · novelty 6.0

RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

cs.RO · 2026-02-11 · unverdicted · novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

cs.RO · 2026-06-10 · unverdicted · novelty 5.0

TacCoRL integrates tactile feedback into VLA policies via real-aligned simulation co-training and RL, raising average success from 50% to 72.5% on four bimanual contact-rich tasks with direct real-robot transfer.

DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

cs.RO · 2026-06-08 · unverdicted · novelty 5.0

DexPIE improves dexterous manipulation success rates by 37% over demo policies via real-world experience collection with adapted intervention, multi-stage DAgger, asynchronous relative-action inference, and optimality conditioning.

Scaling by Diversified Experience for Vision-Language-Action Models

cs.CV · 2026-06-08 · unverdicted · novelty 5.0

SyVLA uses Intention Decoupling and similar-sample guided RL on diversified experiences to improve VLA model task success and out-of-distribution generalization while keeping vision-language abilities.

Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 5.0

GTP-FA is a grasp-then-plan framework with failure attribution that diagnoses errors to optimize grasping priors and planning data collection, raising success rates across RL, IL, diffusion, and VLA methods in simulation and real robots.

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

cs.RO · 2026-05-28 · unverdicted · novelty 5.0

Analysis reveals Pi-GCRL degradation in contact-rich tasks due to hybrid dynamics; contact-aware and hierarchical formulations are proposed to extend it to manipulation.

citing papers explorer

Showing 27 of 27 citing papers after filters.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies cs.RO · 2026-06-06 · unverdicted · none · ref 9
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies cs.RO · 2026-05-12 · unverdicted · none · ref 25
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 39
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching cs.RO · 2026-04-13 · unverdicted · none · ref 5
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.
Z-1: Efficient Reinforcement Learning for Vision-Language-Action Models cs.RO · 2026-06-30 · unverdicted · none · ref 17
Z-1 uses task-wise GRPO post-training on a flow-based VLA model to reach 80.6% average success across 24 RoboCasa tasks, a 13.2-point gain over its SFT baseline.
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models cs.RO · 2026-06-29 · unverdicted · none · ref 8
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models cs.CV · 2026-06-21 · unverdicted · none · ref 7
PolicyTrim is an RL post-training framework that boosts VLA policy efficiency by 3x chunk utilization and 51.4% fewer steps, yielding up to 5.83x speedup.
Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models cs.RO · 2026-05-21 · unverdicted · none · ref 4
Agentic-VLA enables efficient online adaptation of VLA models, delivering +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and 2.4x faster convergence on LIBERO through three new components.
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning cs.RO · 2026-05-19 · unverdicted · none · ref 17
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
Reinforcing VLAs in Task-Agnostic World Models cs.AI · 2026-05-12 · unverdicted · none · ref 5 · 2 links
RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models cs.RO · 2026-05-10 · unverdicted · none · ref 4
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies cs.RO · 2026-05-01 · unverdicted · none · ref 20 · 2 links
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 23 · 2 links
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models cs.LG · 2026-04-24 · unverdicted · none · ref 29
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks cs.RO · 2026-04-11 · unverdicted · none · ref 23
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
RISE: Self-Improving Robot Policy with Compositional World Model cs.RO · 2026-02-11 · unverdicted · none · ref 14
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning cs.RO · 2026-02-11 · unverdicted · none · ref 11
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's
$\pi^{*}_{0.6}$: a VLA That Learns From Experience cs.LG · 2025-11-18 · unverdicted · none · ref 34
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
TacCoRL: Integrating Tactile Feedback into VLA via Simulation cs.RO · 2026-06-10 · unverdicted · none · ref 63
TacCoRL integrates tactile feedback into VLA policies via real-aligned simulation co-training and RL, raising average success from 50% to 72.5% on four bimanual contact-rich tasks with direct real-robot transfer.
DexPIE: Stable Dexterous Policy Improvement from Real-World Experience cs.RO · 2026-06-08 · unverdicted · none · ref 36
DexPIE improves dexterous manipulation success rates by 37% over demo policies via real-world experience collection with adapted intervention, multi-stage DAgger, asynchronous relative-action inference, and optimality conditioning.
Scaling by Diversified Experience for Vision-Language-Action Models cs.CV · 2026-06-08 · unverdicted · none · ref 6
SyVLA uses Intention Decoupling and similar-sample guided RL on diversified experiences to improve VLA model task success and out-of-distribution generalization while keeping vision-language abilities.
Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 56
GTP-FA is a grasp-then-plan framework with failure attribution that diagnoses errors to optimize grasping priors and planning data collection, raising success rates across RL, IL, diffusion, and VLA methods in simulation and real robots.
Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics cs.RO · 2026-05-28 · unverdicted · none · ref 5
Analysis reveals Pi-GCRL degradation in contact-rich tasks due to hybrid dynamics; contact-aware and hierarchical formulations are proposed to extend it to manipulation.
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models cs.RO · 2026-05-28 · unverdicted · none · ref 23
BORA combines offline RL critic training with online chunk-wise residual adaptation to raise average success rates of real-world dexterous VLA policies by 33% and up to 43% on unseen objects across five tasks.
Trust Region Q Adjoint Matching cs.LG · 2026-05-26 · unverdicted · none · ref 6
TRQAM adds a trust region to QAM by optimizing λ in SOC dynamics to achieve closed-form control of path-space KL, yielding 68% success rate on 50 OGBench tasks versus 46% for the strongest baseline.
EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models cs.RO · 2026-05-25 · unverdicted · none · ref 11
EXPO-FT enables pretrained VLA policies to reach 30/30 success on complex manipulation tasks using an average of 19.1 minutes of online robot data while outperforming prior RL approaches.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL cs.RO · 2026-04-20 · unverdicted · none · ref 11
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

pi rl: Online rl fine-tuning for flow-based vision-language-action mod- els.arXiv preprint arXiv:2510.25889

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer