pith. sign in

hub Mixed citations

Silkie: Preference distillation for large visual lan- guage models

Mixed citation behavior. Most common role is background (67%).

17 Pith papers citing it
Background 67% of classified citations

hub tools

citation-role summary

background 4 dataset 2

citation-polarity summary

clear filters

representative citing papers

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

Deep Pre-Alignment for VLMs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 26

    rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

  • You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass cs.CV · 2026-04-13 · unverdicted · none · ref 3

    A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.

  • Topo-R1: Detecting Topological Anomalies via Vision-Language Models cs.CV · 2026-03-13 · unverdicted · none · ref 41

    Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.

  • Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 74

    Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

  • Online Self-Calibration Against Hallucination in Vision-Language Models cs.CV · 2026-05-01 · unverdicted · none · ref 15

    OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.

  • Mitigating Object Hallucinations via Sentence-Level Early Intervention cs.CV · 2025-07-16 · conditional · none · ref 28

    SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.

  • PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering cs.CV · 2023-05-17 · conditional · none · ref 31

    PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.

  • InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 73

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  • Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 105

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  • Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 171

    A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

  • A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 117

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.