hub Canonical reference

Grape: Generalizing robot policy via preference alignment

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, Huaxiu Yao · 2024 · arXiv 2411.19309

Canonical reference. 86% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 86% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1

citation-polarity summary

background 6 baseline 1

representative citing papers

Foresight: Iterative Reasoning About Clues that Matter for Navigation

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

Foresight uses iterative VLM plan proposal and critique with RL from human feedback to raise navigation success 37% and cut interventions 52% in real-world tests.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

Freeform Preference Learning for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.

Rethinking Foundation Model Collaboration: Enhancing Specialized Models through Proxy Task Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

FAT decomposes structured prediction into specialist hypothesis generation and foundation-model proxy reasoning, yielding consistent gains over baselines on detection, trajectory, and segmentation tasks.

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.

Learning Process Rewards via Success Visitation Matching for Efficient RL

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.

SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model

cs.RO · 2026-06-15 · unverdicted · novelty 6.0 · 2 refs

SafeDojo is a new world model-based safe RL framework for VLA that outperforms baselines on SafeLIBERO and real robot tasks.

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

UniIntervene uses future-conditioned action-value estimation and a temporal value-risk critic to trigger memory-based recovery interventions, reporting 8.6% higher success rates and 57% fewer human interventions than prior HiL-RL methods on real manipulation tasks.

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

cs.RO · 2026-06-03 · unverdicted · novelty 6.0

FlowPRO applies proximalized preference optimization to flow-matching VLAs with intervention-rollback data to reach higher success rates on long-horizon bimanual tasks without rewards or critics.

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

cs.RO · 2026-05-10 · unverdicted · novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.

Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

cs.RO · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

cs.RO · 2026-02-09 · unverdicted · novelty 6.0

TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

cs.LG · 2025-10-31 · unverdicted · novelty 6.0

DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

cs.RO · 2025-09-11 · conditional · novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

cs.RO · 2025-05-24 · conditional · novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

cs.RO · 2025-05-09 · unverdicted · novelty 6.0

UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

cs.RO · 2025-03-09 · unverdicted · novelty 6.0

AgiBot World supplies over 1 million trajectories enabling GO-1 to deliver 30% average gains over Open X-Embodiment and over 60% success on complex dexterous tasks while open-sourcing everything.

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

cs.RO · 2025-02-09 · unverdicted · novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

cs.RO · 2026-05-19 · unverdicted · novelty 5.0

PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

cs.RO · 2026-05-17 · unverdicted · novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.

ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

cs.RO · 2026-05-09 · unverdicted · novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

cs.RO · 2025-10-20 · unverdicted · novelty 5.0

RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Foresight: Iterative Reasoning About Clues that Matter for Navigation cs.RO · 2026-06-10 · unverdicted · none · ref 21
Foresight uses iterative VLM plan proposal and critique with RL from human feedback to raise navigation success 37% and cut interventions 52% in real-world tests.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 55
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Freeform Preference Learning for Robotic Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 8
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
Rethinking Foundation Model Collaboration: Enhancing Specialized Models through Proxy Task Reasoning cs.CV · 2026-06-30 · unverdicted · none · ref 7
FAT decomposes structured prediction into specialist hypothesis generation and foundation-model proxy reasoning, yielding consistent gains over baselines on detection, trajectory, and segmentation tasks.
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models cs.RO · 2026-06-29 · unverdicted · none · ref 47
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
Learning Process Rewards via Success Visitation Matching for Efficient RL cs.LG · 2026-06-22 · unverdicted · none · ref 97
Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.
SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model cs.RO · 2026-06-15 · unverdicted · none · ref 24 · 2 links
SafeDojo is a new world model-based safe RL framework for VLA that outperforms baselines on SafeLIBERO and real robot tasks.
UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning cs.RO · 2026-06-10 · unverdicted · none · ref 48
UniIntervene uses future-conditioned action-value estimation and a temporal value-risk critic to trigger memory-based recovery interventions, reporting 8.6% higher success rates and 57% fewer human interventions than prior HiL-RL methods on real manipulation tasks.
FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization cs.RO · 2026-06-03 · unverdicted · none · ref 20
FlowPRO applies proximalized preference optimization to flow-matching VLAs with intervention-rollback data to reach higher success rates on long-horizon bimanual tasks without rewards or critics.
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models cs.RO · 2026-05-10 · unverdicted · none · ref 19
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies cs.RO · 2026-05-01 · unverdicted · none · ref 27 · 2 links
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 38 · 2 links
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation cs.RO · 2026-02-09 · unverdicted · none · ref 68
TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.
$\pi^{*}_{0.6}$: a VLA That Learns From Experience cs.LG · 2025-11-18 · unverdicted · none · ref 45
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models cs.LG · 2025-10-31 · unverdicted · none · ref 47
DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions cs.RO · 2025-05-09 · unverdicted · none · ref 92
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems cs.RO · 2025-03-09 · unverdicted · none · ref 28
AgiBot World supplies over 1 million trajectories enabling GO-1 to deliver 30% average gains over Open X-Embodiment and over 60% success on complex dexterous tasks while open-sourcing everything.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control cs.RO · 2025-02-09 · unverdicted · none · ref 44
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 29
PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 191
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation cs.RO · 2026-05-09 · unverdicted · none · ref 74
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation cs.RO · 2025-10-20 · unverdicted · none · ref 27
RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.
Reflection-Based Task Adaptation for Self-Improving VLA cs.RO · 2025-10-14 · unverdicted · none · ref 33
Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey cs.RO · 2025-08-18 · unverdicted · none · ref 177
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning cs.RO · 2025-03-05 · unverdicted · none · ref 29
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
Position: Good Embodied Reward Models Need Bad Behavior Data cs.RO · 2026-05-31 · unverdicted · none · ref 34
Embodied reward models systematically over-reward unsafe, suboptimal, and shortcut robot behaviors due to training on successful data only, and modest inclusion of bad behavior data improves alignment with human preferences.

Grape: Generalizing robot policy via preference alignment

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer