{"total":16,"items":[{"citing_arxiv_id":"2606.22873","ref_index":299,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-22T05:37:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07643","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-01T19:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5-VL [24], Qwen3-VL [5], and InternVL-3.5 [7], culminating in scaling attempts like GLM-5V-Turbo [22] and Kimi K2.5 [21]. Yet, early-fusion represents a native convergent regime where all modalities are modeled within a unified embedding space via one unified backbone. This born-native design, explored by Transfusion [49], Chameleon [50], and AnyGPT [51], achieves omnipresent synergy by treating all modalities equivalently. Building upon this structural taxonomy, we organize the existing NMM ecosystem through the lens of input-output duality into three functional categories to capture the full spectrum of modality flows. (i) The first category, Multi-to- Text (M2T) unimodal generation, leverages native scaling to ground cross-modal inputs into purely linguistic responses"},{"citing_arxiv_id":"2605.11605","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T06:35:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models. InProc. AAAI, 2025. [58] Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. InProc. CVPR, 2024. [59] Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling.arXiv:2402.12226, 2024. [60] Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. InProc. EMNLP, 2023."},{"citing_arxiv_id":"2604.21921","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Context Unrolling in Omni Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"multi-grained video editing. InThe Thirteenth International Conference on Learning Representations, 2025. [47] Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8466-8476, 2024. [48] Jun Zhan and collaborators. Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2024. URLhttps://arxiv.org/abs/2402.12226. [49] Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse"},{"citing_arxiv_id":"2604.08125","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction","primary_cat":"cs.CV","submitted_at":"2026-04-09T11:46:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20901","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarking and Enhancing VLM for Compressed Image Understanding","primary_cat":"cs.CV","submitted_at":"2025-12-24T02:59:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for VLMs on compressed images and a universal adaptor to improve performance across codecs and bitrates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14234","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body","primary_cat":"cs.CV","submitted_at":"2025-12-16T09:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We evaluate on the YouTube test split of Conver-3D using standard metrics-Lip Vertex Error (LVE) for lip syn- chronization [94], Upper-Face Dynamic Deviation (FDD) Table 3. Quantitative results on speech metrics. \"↑\" indicates higher-is-better; best values are inbold. Methods Context Relevance↑ Character Consistency↑ SynMSI [46] (GroundTruth) 4.838 4.893 LLM+Speech (Llama2) 3.859 3.157 AnyGPT [130] (fine-tune) 3.803 - DLP [8] (MotionGPT) 3.577 3.785 SOLAMI(LoRA) [46] 0.824 3.634 SOLAMI(full params) [46] 0.824 3.634 Ours4.584 4.376 Hi, how are you? Howareyou?When was the last time you laughed so hard your stomach hurt, and why? I'm good! Could you perform a walking- in-a-circle motion? Interesting! Could you perform a jumping motion? Hey!"},{"citing_arxiv_id":"2512.01537","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","primary_cat":"cs.SD","submitted_at":"2025-12-01T11:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.20215","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen2.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-26T04:17:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":213,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Comprehension&Generation Multi- Modal outputs ··· embeddings/tokensembeddings/tokens tokens embeddings/tokens Figure 4: Common architectures for comprehension-only and comprehension-generation MLLMs. Prior works, such as NExT-GPT [14] advances this objective for the first time by integrating mul- timodal adapters with various diffusion models. AnyGPT [213] utilizes multimodal discrete tokens to facilitate the generation of diverse multimodal content. Subsequently, Mini-Omni2 [214, 215] in- troduces a command-based interruption mechanism, enhancing user interaction and aligning further with GPT-4o's capabilities. Compared to MLLMs that only support comprehension, as shown in Figure 4, MLLMs that integrate both comprehension and generation either utilize an autoregressive"},{"citing_arxiv_id":"2501.01957","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","primary_cat":"cs.CV","submitted_at":"2025-01-03T18:59:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.07825","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Multimodal Learning with Missing Modality: A Survey","primary_cat":"cs.CV","submitted_at":"2024-09-12T08:15:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.04429","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2024-09-06T17:49:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.18814","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models","primary_cat":"cs.CV","submitted_at":"2024-03-27T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.13549","ref_index":149,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T15:21:52+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"Increased support for modalities is a tendency for MLLM studies. On the one hand, researchers have explored adapting MLLMs to support the input of more multimodal content, such as 3D point cloud [41], [143], [144], [145]. On the other hand, MLLMs are also extended to generate responses of more modalities, such as image [32], [146], [147], [148], audio [32], [147], [149], [150], and video [32], [151]. For example, NExT-GPT [32] IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 10 proposes a framework that supports inputs and outputs of mixed modalities, specifically, combinations of text, image, audio, and video, with the help of diffusion models [152], [153] attached to the MLLM. The framework applies an"}],"limit":50,"offset":0}