{"total":26,"items":[{"citing_arxiv_id":"2605.22012","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-21T05:18:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21260","ref_index":96,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective","primary_cat":"cs.LG","submitted_at":"2026-05-20T14:51:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21008","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Audio Reasoning in Multimodal Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-05-20T10:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19524","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-19T08:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeAlign-VLA uses counterfactual safety pairing and anchor-based group relative policy optimization to incorporate negative data for safer VLA-based autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07106","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-08T01:33:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[14] You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, and Maosong Sun. Imagination helps visual reasoning, but not yet in latent space.arXiv preprint arXiv:2602.22766, 2026. [15] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. [16] Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. [17] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space."},{"citing_arxiv_id":"2605.02735","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs","primary_cat":"cs.LG","submitted_at":"2026-05-04T15:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02130","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-04T01:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025. 8 [66] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 8 [67] Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 6 [68] Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for"},{"citing_arxiv_id":"2604.24339","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection","primary_cat":"cs.CV","submitted_at":"2026-04-27T11:31:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reasoning, yet they are commonly task-specific and gener- alize poorly in open-domain settings. GThinker [59] pro- posed a more flexible, prompt-driven approach with vision- guided reflection to improve cross-task generalization and interpretability. Recent MCoT [25, 44, 48] methods have begun to explore tool-assisted visual reasoning. However, most MCoT [51] methods are still limited by their reliance on the model's internal knowledge or passive processing of visual inputs, as they currently lack a general-purpose, model-autonomous tool-calling mechanism. Reinforcement Learning for VLM.In recent years, RL has gradually emerged as a key approach for enhanc- ing the reasoning capabilities and behavioral alignment of"},{"citing_arxiv_id":"2604.22280","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-24T06:50:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824-24837. [43] Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, and Li Li. 2026. TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings. arXiv preprint arXiv:2603.04772(2026). [44] Yang Xu, Gareth JF Jones, and Bin Wang. 2009. Query dependent pseudo- relevance feedback based on wikipedia. InProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 59- 66. Peixi Wu, Ke Mei et al. [45] Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, and Hanchao Yu."},{"citing_arxiv_id":"2604.21027","ref_index":133,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering","primary_cat":"cs.AI","submitted_at":"2026-04-22T19:18:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20806","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model","primary_cat":"cs.CV","submitted_at":"2026-04-22T17:37:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent advances in large vision-language models (LVLMs) have enabled strong performance on demanding reasoning tasks, from elementary arithmetic to Olympiad-level problems that require deep domain knowledge and multi-step inference [39, 40, 11, 61, 35, 24]. A central driver of this progress is chain-of-thought (CoT) prompting [63], which elicits explicit intermediate reasoning steps in natural language [61, 10, 12]. In multimodal settings, these techniques enable LVLMs to fuse visual cues with textual information, yielding substantial gains on single-image Olympiad-level benchmarks [74, 16]. However,asillustratedinFigure1(a),existingmultimodalOlympiadbenchmarkslargelyremainrestricted tosingle-imagequestionsettings[ 75,18,20]. Inrealscientificandtechnicalsettings, however, problemsoften"},{"citing_arxiv_id":"2604.20755","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-22T16:44:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Rather than operating in raw pixel space, the policy VLM is supervised during an initial SFT phase to ground its reasoning in relative grid/logical coordinates (e.g.,<cell: Row 2, Col 3>). By explicitly outputting these coordinates and their corresponding cell values before executing any arithmetic or comparative logic, the model produces a V-CoT trajectory that is mathematically and struc- turally verifiable [33,40,45]. 6 Authors Suppressed Due to Excessive Length Fig. 2:Overview of the V-tableR1 framework. The policy VLM generates an explicit Visual Chain-of-Thought (V-CoT) over the table image. The critic VLM verifies the visual anchors to distinguish between rigorous inference (Path 1), visual hallucination (Path 2),and shortcutguessing (Path 3).This denseprocessfeedback isthen integrated"},{"citing_arxiv_id":"2604.19083","ref_index":183,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety","primary_cat":"cs.CR","submitted_at":"2026-04-21T04:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18946","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reasoning Structure Matters for Safety Alignment of Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-21T00:50:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15705","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-17T05:24:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In parallel, multi-modal Chain-of-Thought (MM-CoT) rea- soning [39]-[43] has emerged as an important paradigm for enabling LLMs to align and reason across heteroge- neous modalities such as vision and language. For instance, M3CoT [40] and Ddcot [42] introduce multi-domain, multi- step reasoning frameworks that emphasize structured cross- modal inference, while the survey in [41] provides a compre- hensive taxonomy of MM-CoT paradigms and benchmarks. Compared with these inference-time reasoning approaches, the proposed method introduces a proactive causal interven- tion during training through counterfactual sample generation, which aims to enhance causal robustness within multi-modal reasoning processes. This approach is complementary to both"},{"citing_arxiv_id":"2604.14888","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-16T11:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11741","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games","primary_cat":"cs.AI","submitted_at":"2026-04-13T17:16:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10973","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-13T04:21:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10517","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning","primary_cat":"cs.AI","submitted_at":"2026-04-12T08:14:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05497","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-07T06:41:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1 [35] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171, 2022. 3 [36] Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 3 [37] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan-"},{"citing_arxiv_id":"2604.00013","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis","primary_cat":"cs.CL","submitted_at":"2026-03-10T12:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization in multimodal analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03944","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SCP: Spatial Causal Prediction in Video","primary_cat":"cs.CV","submitted_at":"2026-03-04T11:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.23253","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture","primary_cat":"cs.AI","submitted_at":"2025-11-28T15:02:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AgriCoT (Ours)4,535 MCQ, TFQ, OEQ! ! ! main adaptation in specialized applications. Adapting VLMs to agriculture shows promise but also reveals gaps. Domain-specific initiatives include AgroGPT [2] for agricultural VQA, Agri-LLaV A [39] for crop disease diagnosis, and vision-language pipelines for crop health monitoring, disease recognition [49] or parcel segmentation [44]. However, these models are often trained and evaluated on crop- or symptom-specific datasets with limited scene and sensors [26, 49], and current evaluations rarely cover practical skills such as counting, planting rec- ommendations or environmental management. Therefore, there is an urgent need for a benchmark that spans a broader range of agricultural tasks and scenes across multiple sens-"},{"citing_arxiv_id":"2509.23322","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework","primary_cat":"cs.CV","submitted_at":"2025-09-27T14:13:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20490","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows","primary_cat":"cs.MA","submitted_at":"2025-09-24T19:08:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RadAgents is a multi-agent framework coupling clinical priors with task-aware multimodal reasoning and radiologist-like workflows, plus grounding and retrieval-augmentation for conflict resolution in chest X-ray interpretation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.08128","ref_index":109,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2025-07-10T19:40:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021. [108] D. Wang, J. Wu, J. Li, D. Yang, X. Chen, T. Zhang, and H. Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779, 2025. [109] Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605, 2025. [110] B. Weck, I. Manco, E. Benetos, E. Quinton, G. Fazekas, and D. Bogdanov. Muchomu- sic: Evaluating music understanding in multimodal audio-language models. arXiv preprint arXiv:2408."}],"limit":50,"offset":0}