OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
Direct preference optimization: Your lan- guage model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2representative citing papers
citing papers explorer
-
Online Self-Calibration Against Hallucination in Vision-Language Models
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
- GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning