{"total":17,"items":[{"citing_arxiv_id":"2605.25343","ref_index":171,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15300","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Pre-Alignment for VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-14T18:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00323","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Online Self-Calibration Against Hallucination in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-01T01:03:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18034","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation","primary_cat":"cs.CL","submitted_at":"2026-04-20T09:59:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SignDPO uses hierarchical perturbations, self-guided attention-based sampling, and an automated language-level preference generator to align skeleton trajectories with linguistic semantics, outperforming prior gloss-free methods on CSL-Daily, How2Sign, and OpenASL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13029","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Preference Optimization with Rubric Rewards","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:58:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Reward Models for Multimodal Alignment Aligning VLMs with complex human intent has motivated the curation of large-scale multimodal preference datasets. Early datasets relied on extensive human feedback, including LLaV A-RLHF [23], RLHF-V [24], MM-RLHF [28], WildVision [25], and VisionArena [41]. To reduce annotation costs, AI feedback mechanisms followed, as seen in VLFeedback [26], RLAIF-V [13], MIA-DPO [27], and LLaV A-Critic [31]. In parallel, specialized visual RMs have been developed as automated evaluators. Early explorations like Prometheus-Vision [ 29], SIMA [ 30], and LLaV A-Critic [31] established the foundation. Recent work has further refined RM capabilities: CAREVL [42] distills language reward knowledge into visual RMs, SVIP-Reward [ 43] incorporates stepwise visual programs,"},{"citing_arxiv_id":"2604.10966","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass","primary_cat":"cs.CV","submitted_at":"2026-04-13T04:02:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.13054","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Topo-R1: Detecting Topological Anomalies via Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-03-13T15:05:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12455","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Object Hallucinations via Sentence-Level Early Intervention","primary_cat":"cs.CV","submitted_at":"2025-07-16T17:55:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.10442","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing remains under-explored. Building on these insights, we conduct a systematic study on using PO to strengthen the multimodal reasoning capabilities of MLLMs. Enhancing the multimodal reasoning abilities of MLLMs through PO presents several challenges: (1) Limited multimodal reasoning preference data and high annotation cost. Existing multimodal preference datasets [48, 88, 110, 111, 117] primarily address hallucination issues and focus on natural images and perception data, lacking scientific images and reasoning data. Annotating these types of data requires human annotators to carefully compare the given reasoning processes, making it both time-consuming and costly. (2) Lack of open-source meth- ods for improving multimodal reasoning via PO."},{"citing_arxiv_id":"2407.03320","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"standing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 9 [72] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 6 [73] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 2 [74] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-Gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403."},{"citing_arxiv_id":"2404.18930","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","submitted_at":"2024-04-29T17:59:41+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Summary of most relevant benchmarks and metrics of object hallucination in MLLMs. The order is based on chronological order on arxiv. In the metric column, Acc/P/R/F1 denotes Accuracy/Precision/Recall/F1- Score. Benchmark Venue UnderlyingData Source Size TaskType Metric Hallucination Type Category Attribute Relation Others CHAIR [137] EMNLP'18 MSCOCO [105] 5,000 Gen CHAIR ✓ ✗ ✗ ✗POPE [103] EMNLP'23 MSCOCO [105] 3,000 Dis Acc/P/R/F1 ✓ ✗ ✗ ✗MME [42] arXiv'23 Jun MSCOCO [105] 1457 Dis Acc/Score ✓ ✓ ✗ ✓MMBench [116] ECCV'24 Not Specified 3217 Dis Acc ✓ ✓ ✓ ReasoningCIEM [63] NeurIPS-W'23 MSCOCO [105] 78120 Dis Acc ✓ ✗ ✗ ✗M-HalDetect [52] AAAI'24 MSCOCO [105] 4,000 Dis Reward Model Score✓ ✗ ✗ ✗MMHal-Bench [148] arXiv'23 Sep."},{"citing_arxiv_id":"2404.07972","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[24] Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604, 2024. [25] Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. Sheetcopilot: Bring- ing software productivity to the next level through large language models. arXiv preprint arXiv:2305.19308, 2023. [26] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models. arXiv preprint arXiv:2312.10665, 2023. [27] Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences."},{"citing_arxiv_id":"2402.13116","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Knowledge Distillation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-02-20T16:17:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.11684","ref_index":106,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2024-02-18T19:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.11411","ref_index":161,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning Modalities in Vision Large Language Models via Preference Fine-tuning","primary_cat":"cs.LG","submitted_at":"2024-02-18T00:56:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.13549","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T15:21:52+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"possible answer options are predefined and limited to a finite set. The evaluation is usually performed on task- specific datasets. In this case, the responses can be naturally judged by benchmark metrics [20], [60], [70], [76], [101], [102], [103], [104]. For example, InstructBLIP [60] reports the accuracy on ScienceQA [116], as well as the CIDEr score [117] on NoCaps [118] and Flickr30K [119]. The evalu- ation settings are typically zero-shot [60], [102], [104], [105] or finetuning [20], [35], [60], [70], [76], [101], [103], [105]. The first setting often selects a wide range of datasets covering different general tasks and splits them into held-in and held-out datasets. After tuning on the former, zero-shot"},{"citing_arxiv_id":"2305.10415","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering","primary_cat":"cs.CV","submitted_at":"2023-05-17T17:50:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}