{"total":19,"items":[{"citing_arxiv_id":"2605.18621","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:31:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CrossView Suite supplies a 1.6M-sample dataset, scene-disjoint benchmark, and explicit-alignment framework to advance MLLMs from single-view perception to cross-view spatial intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11462","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images","primary_cat":"cs.CV","submitted_at":"2026-05-12T03:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"6 24.6 31.7 41.2 40.3 33.2 Qwen2.5-VL-7B [3]75.083.179.039.2 37.4 45.0 39.2 38.8 Qwen3-VL-2B [3] 73.7 83.4 78.6 42.6 35.6 41.2 35.7 32.2 Open-sourced Specialized Models SpaceQwen2.5-VL-3B-Instruct[29] 54.9 60.7 57.8 36.9 32.047.440.3 33.3 Spatial-MLLM-4B [9] - - - 31.5 - - - 32.1 SpaceR-7B [30] 49.9 36.4 43.2 37.6 33.3 32.1 30.3 37.9 SpaceMantis-8B [52] - - - 41.0 26.3 42.3 36.4 22.8 SpatialBot-3B [12] - - - - - 40.2 35.7 - SpatialLadder-3B [10] 72.4 74.9 73.7 34.4 27.1 27.5 26.9 32.5 Ours SpatialForge Qwen3-VL-2B 72.2 (-1.5) 85.2 (+1.8) 78.7 (+0.1) 50.6 (+8.0) 38.0 (+2.4) 44.2 (+3.0) 43.1 (+7.4) 42.3 (+10.1) spatial understanding. As shown in Table 3, SpatialForge yields the notable improvements on relation-"},{"citing_arxiv_id":"2604.27393","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction","primary_cat":"cs.CL","submitted_at":"2026-04-30T04:05:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"evaluates the ability to recognize, extract, and reason over text in visually rich documents and scene images. We use OCRBench [41], TextVQA [42], DocVQA [43], and OmniDocBench [44], which require joint modeling of textual content, visual layout, and document structure. (3)Multi-image understanding. This domain measures the ability to aggregate and compare information across multiple images. We adopt Mantis-Eval [ 45], MUIRBench [ 46], and MMSI-Bench [ 47], which evaluate cross-image reasoning, visual comparison, and multi-image information integration. (4) Hallucination. This domain evaluates whether model responses remain faithful to the visual input. We use HallusionBench [48] and MMHal-Bench [49], which measure visual consistency and halluci- nation in multimodal generation."},{"citing_arxiv_id":"2604.22498","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-24T12:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 1: Performance comparison on MIG-Bench. Best non-human results are shown in bold. Models Base MLLM Spontaneous Grounding Referential Grounding A VEDifference Similarity Visual Reference Textual Visual+Textual Static Robust Common OT MV Region Refer GG Reason Co-Re Human - 99.50 97.87 98.00 100.00 96.88 100.00 98.99 91.06 92.08 97.44 97.18 Large-Scale Models (≥70B) LLaVA-OV-72B [22] - 13.26 5.34 26.84 12.91 7.64 2.14 17.83 21.60 11.88 8.55 13.65 InternVL2-76B [40] - 15.91 10.64 36.40 30.73 20.83 5.74 46.46 41.28 32.67 26.50 26.72 InternVL3-78B [61] - 10.04 9.57 24.12 27.08 14.58 10.44 50.51 38.08 45.54 17.09 24.71 Qwen2-VL-72B [42] - 46.12 46.81 64.46 26.73 22.57 18.62 33.33 62.53 50.50 17.09 38.88 Qwen2.5-VL-72B [3] - 43.75 46."},{"citing_arxiv_id":"2603.25120","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization","primary_cat":"cs.DC","submitted_at":"2026-03-26T07:45:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04676","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks","primary_cat":"cs.CV","submitted_at":"2026-03-04T23:34:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.16518","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiMo-Embodied: X-Embodied Foundation Model Technical Report","primary_cat":"cs.RO","submitted_at":"2025-11-20T16:34:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.18154","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","primary_cat":"cs.LG","submitted_at":"2025-09-16T19:41:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"larger model variants also achieve measurable improvements in text and document understanding. 3.4 Multi-Image Understanding To assess InternVL3's ability to understand and reason over multiple images -- a key aspect of multimodal interaction -- we conduct comprehensive evaluations on a suite of widely recognized benchmarks, including BLINK [36], Mantis-Eval [51], MMIU [91], MuirBench [132], MMT-Bench [166], and MIRB [181]. These benchmarks evaluate critical skills such as cross-image reasoning and context integration, which are essential for effective multimodal systems. 12 Model Name BLINK (val) Mantis Eval MMIU Muir Bench MMT (val) MIRB (avg) Overall RealWorld QA MME-RW (EN) WildVision (win rate) R-Bench"},{"citing_arxiv_id":"2507.00748","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-07-01T13:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"InternVL3-78B attains a remarkable OCRBench score of 906 and VCR scores of 96.0/98.6, clearly surpassing the corresponding metrics of comparable models. 3.4 Multi-Image Understanding we evaluate the multi-image relation perception and understanding capabilities of InternVL3 across a suite of widely recognized benchmarks, including BLINK [39], Mantis-Eval [51], MMIU [95], MuirBench [118], MMT-Bench [137], and MIRB [ 153], as presented in Table 4. These benchmarks comprehensively assess skills such as cross-image reasoning and context integration, all of which are crucial for effective multimodal interaction. InternVL3 consistently outperforms its earlier counterparts across different parameter scales. For instance, at"},{"citing_arxiv_id":"2501.13918","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving Video Generation with Human Feedback","primary_cat":"cs.CV","submitted_at":"2025-01-23T18:55:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807-21818, 2024. [27] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. [28] Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024. [29] Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406."},{"citing_arxiv_id":"2412.05271","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Screen2Words (en) [240], WebSight (en) [122], Widget-Caption (en) [136], RICOSCA (en) [55], Seeclick (en) [37], ScreenQA (en) [92], AMEX (en) [22], AITW (en) [198], Odyssey (en) [168],GUI UIBert (en) [12], AndroidControl (en) [135], Mind2Web (en) [57], OmniACT (en) [106], WaveUI (en) [4] Type: Multi-Image Datasets Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174]"},{"citing_arxiv_id":"2409.17146","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-09-25T17:59:51+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. In Findings of EMNLP, 2024. [43] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 18 [44] Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 20 [45] Roy Jonker and Ton V olgenant. A shortest augmenting path algo- rithm for dense and sparse linear assignment problems.Computing, 1987. 14 [46] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan."},{"citing_arxiv_id":"2409.02813","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark","primary_cat":"cs.CL","submitted_at":"2024-09-04T15:31:26+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.04840","ref_index":215,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2024-08-09T03:25:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.03326","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-OneVision: Easy Visual Task Transfer","primary_cat":"cs.CV","submitted_at":"2024-08-06T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"8 % 51.7 % 50.7 % 57.9 % 63.1 % MM-LiveBench [ 161] (2406) Internet Content Understanding 49.9 % 77.1 % 81.5 % - 92.4 % LLaV A-Wilder [65] (small) Realworld Chat 55.0 % 67.8 % 72.0 % 81.0 % 85.9 % Multi-Image LLaV A-Interleave [68] Out-domain 33.3 % 64.2 % 79.9 % 60.3 % - MuirBench [135] Comprehensive Multi-image 25.5 % 41.8 % 54.8 % 62.3 % - Mantis [47] Multi-image in the Wild 39.6 % 64.2 % 77.6 % 62.7 % - BLINK [31] Unusual Visual Scenarios 52.1 % 48.2 % 55.4 % 51.1 % - †Text-rich VQA [84] OCR, Webpage, Ducument 65.0 % 80.1 % 83.7 % 54.5 % - Video ActivityNetQA [155] Spatio-Temporal Reasoning 50.5 % 56.6 % 62.3 % 57.0 % - EgoSchema [98] Egocentric Video Understanding 26.8 % 60.1 % 62.0 % - - PerceptionTest [115]"},{"citing_arxiv_id":"2407.07895","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2024-07-10T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"plicitly enable the ICL capability, MIMIC-IT [25] proposes an automatic pipeline to create 2.8M multimodal samples in the instruction-tuning stage. On the other hand, the lat- ter multi-image scenarios aim to tackle diverse real-world applications scenarios that involve multi-images. The train- ing data of VPG-C [27] collected 4 new datasets with Chat- GPT. Mantis-Instruct [19] compiles existing 11 interleaved datasets and creates 4 new datasets. The proposed M4- Instruct [19] compiles existing 41 interleaved datasets and creates 6 new datasets, covering a much higher scenarios diversity than Mantis-Instruct. Interleaved LMMs. As representative closed-source LMMs, both GPT-4V [42] and Gemini [12] support real- world multi-image application scenarios with leading per-"},{"citing_arxiv_id":"2406.09411","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding","primary_cat":"cs.CV","submitted_at":"2024-06-13T17:59:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}