{"total":15,"items":[{"citing_arxiv_id":"2605.15951","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15300","ref_index":146,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Pre-Alignment for VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-14T18:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12960","ref_index":69,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DiM\\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging","primary_cat":"cs.CL","submitted_at":"2026-05-13T03:50:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20328","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hybrid Latent Reasoning with Decoupled Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-04-22T08:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We report results on three high-resolution benchmarks targeting ultra-high-definition perception and visual search: HRBench (4K & 8K) [35] and V* [39]. Additionally, we evaluate on a robust suite of general and diagnostic VQAbenchmarkstocovervariousspecializedcapabilities:MMStar[4](visualde- pendency),MMVP[29](fine-grainedvisualpatterns),SeedBench2Plus[19](mul- timodal reasoning), BLINK [7] (core visual perception), and HallusionBench [9] (illusion and hallucination diagnostics). Baselines.To rigorously evaluate our approach, we benchmark it against four categories of representative baselines: (1)Proprietary Frontier MLLMs, in- cludingGPT-4o[15]andGemini-3-Flash[8];(2)Open-sourceGeneralMLLMs, including LLaVA-OneVision-7B [18] and Qwen2."},{"citing_arxiv_id":"2604.15809","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow","primary_cat":"cs.CV","submitted_at":"2026-04-17T08:07:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10500","ref_index":29,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Enhanced Depth Scaling for Multimodal Latent Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-12T07:14:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Bench [ 59], GQA [ 1]); (3) Multimodal Composition Reasoning (MMStar [ 4], BLINK [ 11], ScienceQA [ 38], M3CoT [5]). Additional descriptive details can be found in the Appendix. Baselines.We evaluate our method against a comprehen- sive set of state-of-the-art baselines, which can be cate- gorized into four paradigms: (1) Zero-shot VLMs (GPT- 4o [ 42], LLaV A-OneVision [ 29], InternVL3.5-8B [ 60], Qwen2.5-VL-7B [ 2]), (2) Explicit CoT-based methods (SCAFFOLD [ 27], ICoT [ 13], Multimodal-CoT [ 77], CCoT [ 40], Chain-of-Focus [ 75]), (3) Visual Enhanced 6 M3CoT ScienceQA GQA Method ModelAcc.(%)↑# AR Steps↓Avg. Time↓Acc.(%)↑# AR Steps↓Avg. Time↓Acc.(%)↑# AR Steps↓Avg. Time↓ No-CoT 45.4 - - 64.4 - - - - - Multimodal CoT [77] 42."},{"citing_arxiv_id":"2604.08545","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"multimodal models, including Pixel-Reasoner [34], DeepEyes [56], Thyme [51], DeepEyesV2 [8], Mini-o3 [12], and Skywork-R1V4-30B-A3B [53]. Benchmarks.We evaluateMetisacross two broad groups of benchmarks covering complementary cognitive capabilities.Perception and Document Understanding:V*Bench [ 42], HRBench- 4K/8K [37], TreeBench [], MME-RealWorld [52], SEEDBench2-Plus [15], and CharXiv (descrip- tive and reasoning questions) [ 39].Mathematical and Logical Reasoning:MathVista mini [19], MathVersemini [50], WeMath [21], DynaMath [58], and LogicVista [43]. 4.2 Main Results We present a comprehensive evaluation ofMetisacross perception, document understanding, and mathematical reasoning benchmarks. As shown in Table 1 and Table 2,Metisestablishes new"},{"citing_arxiv_id":"2601.06803","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Forest Before Trees: Latent Superposition for Efficient Visual Reasoning","primary_cat":"cs.CL","submitted_at":"2026-01-11T08:30:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.14998","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR","primary_cat":"cs.CV","submitted_at":"2025-11-19T00:41:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.05271","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepEyesV2: Toward Agentic Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-07T14:31:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"overall score is the average score of all benchmarks. 3.3 OCR, Chart, and Document Understanding To evaluate the comprehensive capabilities of the model across tasks related to text, document, and chart compre- hension, we conduct an extensive assessment on nine benchmarks: AI2D [55], ChartQA [87], TextVQA [112], DocVQA [89], InfoVQA [88], OCRBench [72], SEED-2-Plus [59], CharXiv [148], and VCR [176]. As shown in Table 4, InternVL3.5 achieves competitive results on these benchmarks, outperforming other open-source and closed-source models. At the lightweight scale, InternVL3.5 demonstrates significant potential. For instance, InternVL3.5-2B attains an overall average score of 76.7 across nine benchmarks, surpassing"},{"citing_arxiv_id":"2504.10479","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"This improvement underscores the effectiveness of test-time scaling. 3.3 OCR, Chart, and Document Understanding To assess the model's integrated vision-language understanding in tasks involving text, document, and chart com- prehension, we perform a comprehensive evaluation over nine benchmarks, including AI2D [57], ChartQA [91], TextVQA [107], DocVQA [ 93], InfoVQA [ 92], OCRBench [ 76], SEED-2-Plus [ 61], CharXiv [ 128], and VCR [148]. As illustrated in Table 3, the InternVL3 series not only maintains robust performance across these benchmarks but also demonstrates competitive or superior results when compared to other open-source and closed-source counterparts. At the 1B scale, InternVL3-1B achieves performance that is roughly on par with previous lower-scale models."},{"citing_arxiv_id":"2501.00321","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","primary_cat":"cs.CV","submitted_at":"2024-12-31T07:32:35+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"better evaluation quality, we further collect and label 1, 500 additional text-images from scratch, reserved as the private test set. This private data serves as an independently curated test set to validate model generalization. In summary, the contributions of this work are three-fold: 2 Benchmark #Scenario #Task #Image #Instruction OCRbench [14] ∼ 14 5 0.9k 1k Seed-bench-2-plus [15] ∼ 8 1 0.6k 2.3k CONTEXTUAL [16] ∼ 11 1 0.5k 0.5k Fox [17] 2 9 0.7k 2.2k MMTab-eval [28] 1 9 23k 49k ComTQA [19] 1 4 1.6k 9k ChartX [20] 1 7 6k 6k MMC [29] 1 9 1.7k 2.9k OmniDocBench [25] 9 5 1k 1k MMLONGBENCH-DOC [27] 7 2 6.4k 1.1k OCRBench v2 (Ours) 31 23 9.5k 10k Table 1: Comparison between the proposed benchmark and existing text-centric datasets."},{"citing_arxiv_id":"2412.05271","ref_index":125,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"0 839 69.7 38.9 / 75.2 83.2 / 91.3 InternVL2.5-78B 89.1 / 95.7 88.3 83.4 95.1 84.1 854 71.3 42.4 / 82.3 95.7 / 94.5 Table 7:Comparison of OCR, chart, and document understanding performance.We evaluate OCR-related capabilities across 9 benchmarks, including AI2D [ 109], ChartQA [181], TextVQA [212], DocVQA [184], InfoVQA [183], OCRBench [158], SEED-2-Plus [125], CharXiv [257], and VCR [ 302]. Part of results are collected from [64, 54, 8, 257, 302] and the OpenCompass leaderboard [46]. Mathematical reasoning reflects a higher-level reasoning capability and enhances the potential of MLLMs in scientific and engineering applications. In the right-hand section of Table 6, we present InternVL 2.5's performance across four multimodal mathematical benchmarks."},{"citing_arxiv_id":"2408.13257","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","primary_cat":"cs.CV","submitted_at":"2024-08-23T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}