hub

RL-VLM-F: Reinforcement learn- ing from vision language foundation model feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, Zackory Erickson · 2024 · arXiv 2402.03681

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.

Freeform Preference Learning for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.

Learning Process Rewards via Success Visitation Matching for Efficient RL

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

cs.RO · 2026-06-02 · unverdicted · novelty 5.0

MAGNIFIED applies RL fine-tuning to MLLMs for autonomous driving motion planning, yielding over 10.5% lower overlap rate and 38.9% lower off-road rate than SFT baseline on Waymo Open Motion Dataset.

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.

Reflection-Based Task Adaptation for Self-Improving VLA

cs.RO · 2025-10-14 · unverdicted · novelty 5.0

Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.

UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

cs.CV · 2025-08-15 · unverdicted · novelty 5.0

UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.

Digitizing Coaching Intelligence: An Agentic Framework for Holistic Athlete Profiling using VLM and RAG

cs.CV · 2026-06-26 · unverdicted · novelty 3.0

Presents a hybrid agentic framework using MediaPipe, Llama-4-scout VLM, LangGraph orchestration, and RAG for holistic athlete profiling aligned with SAI protocols.

Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent

cs.AI · 2026-02-23 · unverdicted · novelty 2.0

A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.

citing papers explorer

Showing 13 of 13 citing papers after filters.

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents cs.LG · 2026-06-30 · unverdicted · none · ref 25
QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.
World Model Self-Distillation: Training World Models to Solve General Tasks cs.CV · 2026-06-10 · unverdicted · none · ref 65
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences cs.CV · 2026-05-30 · unverdicted · none · ref 71
CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG · 2026-05-17 · unverdicted · none · ref 52
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
Freeform Preference Learning for Robotic Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 42
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
Learning Process Rewards via Success Visitation Matching for Efficient RL cs.LG · 2026-06-22 · unverdicted · none · ref 89
Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.
Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations cs.RO · 2026-05-21 · unverdicted · none · ref 25
A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.
MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning cs.RO · 2026-06-02 · unverdicted · none · ref 29
MAGNIFIED applies RL fine-tuning to MLLMs for autonomous driving motion planning, yielding over 10.5% lower overlap rate and 38.9% lower off-road rate than SFT baseline on Waymo Open Motion Dataset.
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents cs.CV · 2026-05-18 · unverdicted · none · ref 43
AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.
Reflection-Based Task Adaptation for Self-Improving VLA cs.RO · 2025-10-14 · unverdicted · none · ref 39
Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning cs.CV · 2025-08-15 · unverdicted · none · ref 5
UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.
Digitizing Coaching Intelligence: An Agentic Framework for Holistic Athlete Profiling using VLM and RAG cs.CV · 2026-06-26 · unverdicted · none · ref 19
Presents a hybrid agentic framework using MediaPipe, Llama-4-scout VLM, LangGraph orchestration, and RAG for holistic athlete profiling aligned with SAI protocols.
Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent cs.AI · 2026-02-23 · unverdicted · none · ref 182
A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.

RL-VLM-F: Reinforcement learn- ing from vision language foundation model feedback

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer