Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation

VLA-OPD: Bridging Offline SFT, Online RL for Vision-Language-Action Models via On-Policy Distillation , author= · 2026 · arXiv 2603.26666

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

cs.LG · 2026-06-01 · unverdicted · novelty 4.0

Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 43 · 2 links
UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation cs.LG · 2026-06-01 · unverdicted · none · ref 46
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards cs.LG · 2026-06-01 · unverdicted · none · ref 18
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.

Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation

fields

years

verdicts

representative citing papers

citing papers explorer