pith. sign in

Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 3

citation-polarity summary

fields

cs.CV 5

roles

background 3

polarities

background 3

representative citing papers

Visual-ERM: Reward Modeling for Visual Equivalence

cs.CV · 2026-03-13 · unverdicted · novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

Visual-RFT: Visual Reinforcement Fine-Tuning

cs.CV · 2025-03-03 · conditional · novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

cs.CV · 2024-11-15 · unverdicted · novelty 6.0

LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.

citing papers explorer

Showing 5 of 5 citing papers.

  • Visual-ERM: Reward Modeling for Visual Equivalence cs.CV · 2026-03-13 · unverdicted · none · ref 38

    Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

  • Unified Reward Model for Multimodal Understanding and Generation cs.CV · 2025-03-07 · unverdicted · none · ref 6

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  • VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 56

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  • Visual-RFT: Visual Reinforcement Fine-Tuning cs.CV · 2025-03-03 · conditional · none · ref 45

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  • LLaVA-CoT: Let Vision Language Models Reason Step-by-Step cs.CV · 2024-11-15 · unverdicted · none · ref 64

    LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.