pith. sign in

arxiv: 2506.22832 · v3 · submitted 2025-06-28 · 💻 cs.CV · cs.AI

Listener-Rewarded Thinking in VLMs for Image Preferences

Pith reviewed 2026-05-19 07:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords listener rewardvision language modelsimage preferencesreinforcement learningGRPOchain of thoughtpreference alignmentreasoning consistency
0
0 comments X p. Extension

The pith

A frozen listener VLM supplies confidence scores that shape RL rewards and improve reasoning consistency for image preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that adding an independent frozen vision-language model as a listener to evaluate and score a reasoner's chain-of-thought produces better rewards in group relative policy optimization. This listener-shaped reward pushes the reasoner to generate explanations that an outside model finds persuasive, which reduces internal contradictions while lifting accuracy and out-of-distribution performance. A sympathetic reader would care because existing reward models for aligning text-to-image generators often memorize training data and fail on new human preferences. The approach offers a data-efficient alternative that avoids heavy new annotation by reusing the listener's existing calibration. The reported gains are 67.4 percent accuracy on the ImageReward benchmark and up to six percent better results on a 1.2 million vote human preference set.

Core claim

The central claim is that listener-augmented GRPO, in which a frozen VLM re-evaluates the reasoner's chain-of-thought and injects a dense calibrated confidence score into the RL reward, trains vision-language models to answer image-preference questions both correctly and in ways that survive independent scrutiny, yielding 67.4 percent accuracy on ImageReward, up to six percent OOD lift on a 1.2M-vote dataset, and fewer reasoning contradictions than plain GRPO or SFT.

What carries the argument

Listener-shaped reward: an independent frozen VLM evaluates the reasoner's chain-of-thought and returns a calibrated confidence score that augments the GRPO reward signal.

If this is right

  • The listener reward reaches 67.4 percent accuracy on the ImageReward benchmark.
  • The same scheme improves out-of-distribution performance on a 1.2 million vote human preference dataset by as much as six percent over a naive reasoner.
  • Reasoning contradictions between the trained model and an independent evaluator drop compared with both GRPO and supervised fine-tuning baselines.
  • The method supplies a scalable route to aligning vision-language models with nuanced human visual preferences without large new annotation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same listener-reward pattern could be applied to video preference or multimodal reasoning tasks where explanation consistency is the bottleneck.
  • If the listener and reasoner share the same base model family, the gains might shrink; testing with deliberately mismatched listener architectures would reveal how much independence is required.
  • The reduction in contradictions may improve downstream use of the generated explanations for human debugging or for chaining into larger agent systems.
  • Because the listener is frozen, the method keeps training cost low and could be iterated by swapping in stronger listeners as they become available.

Load-bearing premise

An independent frozen vision-language model can give reliable, calibrated confidence scores on the reasoner's explanations without adding its own contradictions or biases.

What would settle it

Run the same training but replace the listener's confidence scores with random numbers or with scores from a model whose judgments are known to be uncorrelated with human votes; if the accuracy and OOD gains disappear, the claim is falsified.

Figures

Figures reproduced from arXiv: 2506.22832 by Alexander Gambashidze, Andrey Galichin, Andrey Kuznetsov, Anton Gusarov, Ivan Oseledets, Konstantin Sobolev, Li Pengyi, Matvey Skripkin.

Figure 1
Figure 1. Figure 1: While naive GRPO provides good generalization and already outperforms supervised [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Listener–reasoner disagreement is a strong error signal. Each point aggregates ImageReward test pairs whose ℓ2 distance ∥(s instr 1 , sinstr 2 ) − (s reason 1 , sreason 2 )∥2 falls in a bin. Accuracy drops as the two score vectors diverge. 4.2 Soft rewards Let C = {V, P, T, A<t} denote the conditioning context (visual input V , prompt P, reasoning tokens T and partial answer A<t). The policy πθ outputs log… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy on the high-quality [23] modern dataset at different human agreement thresholds. Listener mechanism consistently improves generalization beyond the strong GRPO baseline. Super￾vised Fine-Tuning and Reasoners are initialized from the same Qwen2.5-VL-7B-Instruct checkpoint. 5 Experiments We initialize our models with Qwen 2.5-VL-7B-Instruct)[26] and evaluate on the ImageReward test set and a large, … view at source ↗
Figure 4
Figure 4. Figure 4: Majority voting across multiple reasoning rollouts improves models insignificantly in OOD. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a listener-augmented Group Relative Policy Optimization (GRPO) framework for aligning vision-language models with human image preferences. A frozen independent VLM serves as a 'listener' that re-evaluates the reasoner's chain-of-thought and supplies a dense calibrated confidence score incorporated into the RL reward. The approach is reported to achieve 67.4% accuracy on the ImageReward benchmark, up to +6% improvement on out-of-distribution performance using a 1.2M-vote human preference dataset, and fewer reasoning contradictions relative to standard GRPO and SFT baselines. The trained reasoning model is released publicly.

Significance. If the central results hold after addressing independence concerns, the listener-reward mechanism offers a scalable and annotation-light route to improving generalization in preference modeling for text-to-image and video generation. The public model release is a clear strength that enables direct reproducibility and external validation. The method could meaningfully advance RL-based alignment techniques if gains are shown to reflect human preference distributions rather than listener-specific artifacts.

major comments (2)
  1. [Abstract] Abstract and results sections: the reported reduction in reasoning contradictions is evaluated against the same frozen listener model that supplies the reward signal; this renders the metric non-independent and risks circular validation of the reward design.
  2. [Results] Results on OOD performance: the +6% gain on the 1.2M-vote dataset and 67.4% ImageReward accuracy lack reported error bars, explicit dataset splits, and analysis of potential distributional overlap between the OOD test sets and the listener VLM's pretraining corpus, which could explain gains via listener-specific alignment rather than robust preference learning.
minor comments (2)
  1. [Abstract] The abstract states concrete performance numbers but does not reference the corresponding tables or figures that would allow readers to inspect baseline comparisons and ablations.
  2. [Methods] Notation for the listener confidence score and its integration into the GRPO objective could be clarified with an explicit equation in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with a focus on clarifying the methodology and strengthening the empirical presentation where possible.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results sections: the reported reduction in reasoning contradictions is evaluated against the same frozen listener model that supplies the reward signal; this renders the metric non-independent and risks circular validation of the reward design.

    Authors: We agree that the contradiction metric is computed with respect to the same frozen listener VLM used to generate the reward signal. This is by design: the listener acts as a fixed, independent evaluator whose judgments define the target behavior we wish the reasoner to internalize. The metric therefore directly quantifies how successfully the training objective has aligned the reasoner with that evaluator. We nevertheless recognize that reporting improvement solely against the training listener can appear circular for external validation. In the revised manuscript we will (i) explicitly state in the abstract and results that the contradiction reduction is measured against the training listener, and (ii) add a supplementary analysis that measures contradictions against a second, architecturally distinct VLM not used during training. These clarifications and the additional check will be incorporated. revision: yes

  2. Referee: [Results] Results on OOD performance: the +6% gain on the 1.2M-vote dataset and 67.4% ImageReward accuracy lack reported error bars, explicit dataset splits, and analysis of potential distributional overlap between the OOD test sets and the listener VLM's pretraining corpus, which could explain gains via listener-specific alignment rather than robust preference learning.

    Authors: We accept that the current results presentation would benefit from greater statistical and distributional detail. In the revised version we will: (a) report error bars for both the ImageReward accuracy and the OOD gains, obtained via multiple random seeds or bootstrap resampling; (b) provide explicit descriptions of the train/validation/test splits used for the 1.2 M-vote human-preference dataset; and (c) include a short analysis of possible overlap between the OOD test images and the listener VLM's pre-training distribution, using domain labels and semantic similarity checks. These additions will be made in the results section and supplementary material. revision: yes

Circularity Check

1 steps flagged

Contradiction reduction is by construction via listener reward design

specific steps
  1. self definitional [Abstract]
    "we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model (listener) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. [...] and reduces reasoning contradictions"

    The failure mode is defined as contradiction with the listener. The reward is then built to increase the listener's confidence on the CoT (i.e., reduce contradictions with the listener). The reported reduction in contradictions is therefore achieved by construction through the listener-shaped reward rather than emerging as a separate validation.

full rationale

The paper defines a key failure mode explicitly in terms of the reasoner's output contradicting an independent frozen listener VLM. It then constructs the RL reward to optimize for higher listener confidence scores on the CoT, explicitly to make explanations 'persuasive to an independent model.' Reporting reduced contradictions is therefore a direct consequence of the reward objective rather than an independent empirical finding. However, the primary claims—67.4% on the ImageReward benchmark and up to +6% OOD gains on the separate 1.2M-vote human preference dataset—remain external to the listener and supply independent content, so the overall derivation does not collapse entirely to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full training details, hyperparameters, and any fitted parameters in the GRPO objective or listener calibration are unavailable. The core addition is the listener-derived reward term.

axioms (1)
  • domain assumption The independent frozen listener VLM supplies a reliable, calibrated confidence score on reasoning traces that improves the reasoner when used in the RL objective.
    This premise is required for the reward shaping to produce the reported gains.

pith-pipeline@v0.9.0 · 5827 in / 1324 out tokens · 34850 ms · 2026-05-19T07:35:13.605531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 15 internal anchors

  1. [1]

    Technical report, DeepSeek-AI, 2024

    Deepseek-r1: Scaling reasoning with reinforced learning. Technical report, DeepSeek-AI, 2024. Technical report. 9

  2. [2]

    Technical report, Tencent AI Lab, 2024

    Hunyuan-video: Scaling text-to-video generation with large-scale rlhf. Technical report, Tencent AI Lab, 2024

  3. [3]

    Technical report, LTX Lab, 2024

    Ltx-video: Large transformer for controllable video generation. Technical report, LTX Lab, 2024

  4. [4]

    Technical report, OpenAI, 2024

    Sora: Multi-modal video generation at scale. Technical report, OpenAI, 2024. Technical report

  5. [5]

    Technical report, WanAI, 2024

    Want2v: High-fidelity text-to-video synthesis via direct preference optimization. Technical report, WanAI, 2024

  6. [6]

    Reasoning Models Don't Always Say What They Think

    Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025

  7. [8]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, jan 2025. Accessed: March 24, 2025

  8. [9]

    Gemini 2.5 pro: Advanced multimodal reasoning model

    Google DeepMind. Gemini 2.5 pro: Advanced multimodal reasoning model. https:// deepmind.google/technologies/gemini/pro/, 2025. Product page and capability demo. Accessed 2025-05-16

  9. [10]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Technical report, DeepSeek, 2023

  10. [11]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model. Technical report, StepFun, 2025. arXiv:2502.10248

  11. [12]

    The poison of alignment

    Chun et al. The poison of alignment. arXiv preprint arXiv:2308.13449, aug 2023. Accessed: March 24, 2025

  12. [13]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, March 2022

  13. [14]

    Aligning diffusion models with noise-conditioned perception

    Alexander Gambashidze, Pavel Kulikov, Maxim Sosnin, and Ivan Makarov. Aligning diffusion models with noise-conditioned perception. arXiv preprint arXiv:2406.17636, 2025

  14. [15]

    Scaling Laws for Reward Model Overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, oct 2022. Accessed: March 24, 2025

  15. [16]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, and Dudu Moshe et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

  16. [17]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023

  17. [18]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, and Jin Zhouet al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  18. [19]

    Reason- ing models can be effective without thinking

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min1, and Matei Zaharia. Reason- ing models can be effective without thinking. 2025

  19. [20]

    OpenAI o1: Learning to reason with reinforcement learning

    OpenAI. OpenAI o1: Learning to reason with reinforcement learning. https://openai. com/index/learning-to-reason-with-llms , 2024. System card released Dec 5 2024. Accessed 2025-05-16. 10

  20. [21]

    OpenAI o3: A multimodal model for math, science, coding, and visual reasoning

    OpenAI. OpenAI o3: A multimodal model for math, science, coding, and visual reasoning. https://platform.openai.com/docs/models/o3, 2025. Model announcement Apr 2025. Accessed 2025-05-16

  21. [22]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

  22. [23]

    Rapidata human style preferences for images

    Rapidata. Rapidata human style preferences for images. https://huggingface.co/ datasets/Rapidata/human-style-preferences-images , 2025

  23. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, July 2017

  24. [25]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  25. [26]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  26. [27]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, et al. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908, 2023

  27. [28]

    Xiaoshi et al. Wu. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, June 2023

  28. [29]

    Grok-3: The age of reasoning agents

    xAI. Grok-3: The age of reasoning agents. https://x.ai/blog/grok-3, 2025. System card and model overview. Accessed 2025-05-16

  29. [30]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision- language models reason step-by-step. arXiv preprint arXiv:2411.10440, nov 2024. Accessed: March 24, 2025

  30. [31]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024

  31. [32]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, April 2023

  32. [33]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  33. [34]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, June 2023

  34. [35]

    Huaisheng Zhu, Teng Xiao, and Vasant G. Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InInternational Conference on Learning Representations (ICLR),

  35. [36]

    OpenReview xyfb9HHvMe. 11