Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
hub Canonical reference
Ming-omni: A unified multimodal model for perception and generation
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 8polarities
background 8representative citing papers
UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention performance by 28 points.
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
citing papers explorer
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
-
When Vision Speaks for Sound
Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention performance by 28 points.
-
Accelerating Compound LLM Training Workloads with Maestro
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
-
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
-
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
- Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey