Demo2Reward optimizes VLM reward model language instructions at test time from a few demonstrations to reduce false positives and enable policy learning in simulated and real robotic tasks without manual reward design.
Icpl: Few-shot in-context preference learning via llms.arXiv preprint arXiv:2410.17233,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Demo2Reward optimizes VLM reward model language instructions at test time from a few demonstrations to reduce false positives and enable policy learning in simulated and real robotic tasks without manual reward design.