Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Bin Wen; Changyi Liu; Fan Yang; Han Li; Haonan Fan; Hongyang Wei; Jiankang Chen; Kaiyu Jiang; Kaiyu Tang; Shuo Yang

arxiv: 2602.07533 · v3 · pith:QAC6KKUDnew · submitted 2026-02-07 · 💻 cs.AI

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Yankai Yang , Yancheng Long , Hongyang Wei , Wei Chen , Tianke Zhang , Kaiyu Jiang , Haonan Fan , Changyi Liu

show 7 more authors

Jiankang Chen Kaiyu Tang Bin Wen Fan Yang Tingting Gao Han Li Shuo Yang

This is my paper

classification 💻 cs.AI

keywords rewardmodelsmodelingsemanticgenerativehumanjointlearning

0 comments

read the original abstract

Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web
cs.MM 2026-05 unverdicted novelty 7.0

WEBSHORTS dataset and SHORTS-CAST framework ground micro-video popularity prediction in structured open-web context collected at upload time and enable selective online adaptation using delayed labels.
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
cs.CV 2026-06 unverdicted novelty 6.0

SpatialFlow-GRPO improves image editing quality by converting region-aware rewards into semantic-region-level optimization signals aligned with latent positions during policy updates.
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
cs.CV 2026-06 unverdicted novelty 6.0

SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
cs.CV 2026-06 unverdicted novelty 6.0

Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.