hub Mixed citations

Vita: Towards open-source interactive omni multimodal llm

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al · 2024 · arXiv 2408.05211

Mixed citation behavior. Most common role is background (60%).

19 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

cs.CV · 2026-05-17 · conditional · novelty 7.0

Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

cs.HC · 2026-04-03 · unverdicted · novelty 7.0

OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

cs.CV · 2025-12-01 · unverdicted · novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

VoiceBench: Benchmarking LLM-Based Voice Assistants

cs.CL · 2024-10-22 · unverdicted · novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

cs.SD · 2026-05-05 · accept · novelty 6.0

MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

cs.CV · 2025-11-18 · conditional · novelty 6.0

OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.

DeepEyesV2: Toward Agentic Multimodal Model

cs.CV · 2025-11-07 · unverdicted · novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

Training-Free Multimodal Large Language Model Orchestration

cs.CL · 2025-08-06 · unverdicted · novelty 6.0 · 2 refs

LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

cs.CV · 2025-01-03 · conditional · novelty 4.0

VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

A Survey of Audio Reasoning in Multimodal Foundation Models

eess.AS · 2026-05-20 · unverdicted · novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

citing papers explorer

Showing 19 of 19 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 49
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 34
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? cs.CV · 2024-08-23 · conditional · none · ref 18
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CV · 2026-05-17 · conditional · none · ref 24
Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue cs.CL · 2026-05-11 · unverdicted · none · ref 12
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments cs.HC · 2026-04-03 · unverdicted · none · ref 7
OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV · 2025-12-01 · unverdicted · none · ref 10
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
VoiceBench: Benchmarking LLM-Based Voice Assistants cs.CL · 2024-10-22 · unverdicted · none · ref 71
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model cs.SD · 2026-05-05 · accept · none · ref 6
MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness cs.CV · 2026-05-01 · unverdicted · none · ref 8
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction cs.AI · 2026-04-21 · unverdicted · none · ref 12
UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models cs.AI · 2026-04-18 · unverdicted · none · ref 13
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 11
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models cs.CV · 2025-11-18 · conditional · none · ref 12
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
DeepEyesV2: Toward Agentic Multimodal Model cs.CV · 2025-11-07 · unverdicted · none · ref 15
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 14 · 2 links
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 13
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction cs.CV · 2025-01-03 · conditional · none · ref 12
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
A Survey of Audio Reasoning in Multimodal Foundation Models eess.AS · 2026-05-20 · unverdicted · none · ref 91
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Vita: Towards open-source interactive omni multimodal llm

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer