{"total":17,"items":[{"citing_arxiv_id":"2605.19461","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Mode Collapse: Distribution Matching for Diverse Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T07:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18641","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leveraging Latent Visual Reasoning in Silence","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:46:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18445","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What's Holding Back Latent Visual Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-05-18T14:14:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09883","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:16:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"questions whether current models genuinely reason over visual content. One line of work reveals that MLLMs fail at perceptual tasks humans find trivial [8, 12, 26, 31, 39]. Another line questions whether strong benchmark scores genuinely reflect visual understanding [4, 7, 38]: evaluation practices may overestimatetruecapability[ 7], reasoningmodescanamplifyhallucinationratherthanimprovevisual grounding [38], and frontier models generate elaborate reasoning for images never provided [4]. These concerns connect to shortcut learning [13, 36]: models exploiting superficial regularities that generalize within benchmarks but collapse under distribution shift. However, the specific mechanism by which models circumvent genuine visual reasoning remains unidentified."},{"citing_arxiv_id":"2605.07825","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Anisotropic Modality Align","primary_cat":"cs.MM","submitted_at":"2026-05-08T14:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05045","ref_index":14,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise","primary_cat":"cs.CV","submitted_at":"2026-05-06T15:41:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24763","ref_index":47,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20012","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training","primary_cat":"cs.CV","submitted_at":"2026-04-21T21:40:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18320","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:20:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Interaction Learning for Enhancing Visual Perception and Reasoning in Vision- Language Models.arXiv preprint arXiv:2510.01304(2025). [45] Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, et al . 2025. Viper: Empowering the self-evolution of visual perception abilities in vision-language model.arXiv preprint arXiv:2510.24285(2025). [46] Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. 2025. Thyme: Think Beyond Images.arXiv preprint arXiv:2508.11630(2025). [47] Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shen- zhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. 2025. Absolute zero: Reinforced self-play reasoning with zero data."},{"citing_arxiv_id":"2604.11025","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images","primary_cat":"cs.CV","submitted_at":"2026-04-13T05:49:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent work on large language models has shown that increas- ing inference-time computation can substantially improve reason- ing performance [10, 14]. A major line of research achieves this throughtest-time scaling, where models generate multiple reason- ing trajectories and aggregate them via self-consistency [2, 20, 32], reranking [34], verifier-based selection [5], or search [39]. These ap- proaches have been especially successful in language-only settings, where the input is fixed and fully observed, and where diversity across trajectories mainly reflects alternative latent reasoning paths rather than uncertainty in evidence acquisition. More broadly, this line of work has established that model performance is shaped not"},{"citing_arxiv_id":"2604.10219","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-11T13:59:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03893","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-04T23:18:58+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"21979 [51] Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, and Yao Lu. 2025. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=02haSpO453 [52] Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. 2025. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models. arXiv:2504.15279 https://arxiv.org/abs/ 2504.15279 [53] Yan Yang, Haochen Tian, Yang Shi, Wulin Xie, Yi-Fan Zhang, Yuhao Dong,"},{"citing_arxiv_id":"2602.18600","ref_index":87,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?","primary_cat":"cs.LG","submitted_at":"2026-02-20T20:22:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Geonav: Empowering mllms with explicit geospatial reasoning abilities for language-goal aerial navigation.arXiv preprint arXiv:2504.09587, 2025. [86] Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025. 15 [87] Yunzhe Xu, Yiyuan Pan, Zhe Liu, and Hesheng Wang. Flame: Learning to navigate with multimodal llm in urban environments. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9005-9013, 2025. [88] Hongyu Yan and Zhiqiang Lv. A survey of sustainable development of intelligent transportation system based on urban travel demand."},{"citing_arxiv_id":"2602.07026","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-02-02T13:59:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.13606","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch","primary_cat":"cs.CV","submitted_at":"2026-01-20T05:11:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20814","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPHINX: A Synthetic Environment for Visual Perception and Reasoning","primary_cat":"cs.CV","submitted_at":"2025-11-25T20:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reinforcement learning has been applied to strengthen LVLMs [27, 39], progress is constrained by benchmarks that emphasize perception over reasoning, such as refer- 1 arXiv:2511.20814v1 [cs.CV] 25 Nov 2025 ring to expression comprehension or math-with-diagram datasets, where models frequently reduce visual inputs to text and rely on language reasoning [62, 71]. More recently, several works have begun to investi- gate abstract visual reasoning (A VR) in LVLMs [6, 12, 23, 25, 32, 62], yet these efforts still fall short of sys- tematically evaluating core perceptual primitives such as symmetry detection, mental rotation, and structured pattern matching. Cognitive science has long established that these abilities underpin fluid intelligence and matrix"},{"citing_arxiv_id":"2505.07062","ref_index":155,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-05-11T17:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"By doing so, we ensure that the reward model remains reliable and adaptable to changing requirements. This approach helps in improving the generalization capability of the model and maintaining high-quality performance over time. 17 4.2.3 Data Curation for Reinforcement Learning Our online reinforcement learning implementation employs a variant of the Proximal Policy Optimization (PPO) algorithm [155]. In this approach, the reward signal is derived from the probability assigned by a reward model to the generated answer tokens. In addition, the ground truth response or the best-of-N responses from an SFT model are given as the reference answer to the reward model during PPO training. Prompts utilized for RL training were derived from the preference dataset."}],"limit":50,"offset":0}