OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
Direct preference optimization: Your lan- guage model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
GRPO-TTA reformulates test-time adaptation for vision-language models as group-wise policy optimization via top-K sampling from CLIP distributions and alignment/dispersion rewards to tune the visual encoder.
citing papers explorer
-
Online Self-Calibration Against Hallucination in Vision-Language Models
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
-
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
GRPO-TTA reformulates test-time adaptation for vision-language models as group-wise policy optimization via top-K sampling from CLIP distributions and alignment/dispersion rewards to tune the visual encoder.