Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al · 2025 · arXiv 2501.12368

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Visual-ERM: Reward Modeling for Visual Equivalence

cs.CV · 2026-03-13 · unverdicted · novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

Unified Reward Model for Multimodal Understanding and Generation

cs.CV · 2025-03-07 · unverdicted · novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

cs.CV · 2025-04-10 · unverdicted · novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

Visual-RFT: Visual Reinforcement Fine-Tuning

cs.CV · 2025-03-03 · conditional · novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

cs.CV · 2024-11-15 · unverdicted · novelty 6.0

LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.

citing papers explorer

Showing 5 of 5 citing papers.

Visual-ERM: Reward Modeling for Visual Equivalence cs.CV · 2026-03-13 · unverdicted · none · ref 38
Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
Unified Reward Model for Multimodal Understanding and Generation cs.CV · 2025-03-07 · unverdicted · none · ref 6
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 56
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Visual-RFT: Visual Reinforcement Fine-Tuning cs.CV · 2025-03-03 · conditional · none · ref 45
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step cs.CV · 2024-11-15 · unverdicted · none · ref 64
LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.

Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer