Recognition: 1 theorem link
Qwen2.5-Omni Technical Report
Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3
The pith
Qwen2.5-Omni processes text, images, audio and video inputs while generating text and streaming speech in one end-to-end system.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Its end-to-end speech instruction following is comparable to its text capabilities on MMLU and GSM8K, and its streaming Talker outperforms most existing alternatives in robustness and naturalness.
What carries the argument
The Thinker-Talker architecture, in which the Thinker operates as a language model for text generation while the Talker directly consumes the Thinker's hidden representations to autoregressively produce audio tokens, together with Time-aligned Multimodal RoPE that interleaves audio and video for synchronized timestamps.
If this is right
- Text and speech can be produced at the same time because the Talker reads the Thinker's states directly.
- Video and audio inputs stay time-aligned through sequential interleaving and the new position embedding.
- Streaming speech decoding uses a sliding-window diffusion transformer that limits the initial delay.
- The single model matches or exceeds the performance of prior separate audio and vision systems on shared benchmarks.
- End-to-end training of both components becomes possible without modality-specific post-processing stages.
Where Pith is reading between the lines
- The same hidden-state handoff could extend to additional output modalities such as video or code if the Talker module is swapped.
- Real-time conversational systems would gain lower latency because one forward pass supplies both text and speech tokens.
- Training data requirements might decrease if the shared Thinker representations transfer across modalities more efficiently than isolated encoders.
- Deployment on edge devices could simplify because only one set of weights needs quantization and serving.
Load-bearing premise
The Thinker-Talker split and TMRoPE fully remove interference between modalities and timestamp misalignment without hidden costs in training stability or generalization that only show up on wider or out-of-distribution tests.
What would settle it
A side-by-side evaluation on out-of-distribution multimodal tasks where the unified model's accuracy falls noticeably below that of separately trained modality specialists would show the interference-avoidance claim does not hold.
read the original abstract
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Qwen2.5-Omni, an end-to-end multimodal model that processes text, image, audio, and video inputs while generating text and streaming natural speech. It employs block-wise encoders for audio/visual streams, interleaves audio and video with the proposed TMRoPE position embedding to align timestamps, and uses a Thinker-Talker architecture in which the Thinker (an LLM) produces text and hidden states that feed a dual-track autoregressive Talker for audio tokens. A sliding-window DiT decoder enables low-latency streaming speech. The report claims the model matches similarly sized Qwen2.5-VL, outperforms Qwen2-Audio, reaches SOTA on Omni-Bench, shows speech instruction-following performance comparable to text on MMLU and GSM8K, and delivers more robust and natural streaming speech than prior alternatives.
Significance. If the performance numbers hold and can be attributed to the architectural choices, the work would be significant for advancing unified multimodal models that support real-time streaming generation. The Thinker-Talker decoupling and TMRoPE alignment mechanism address practical challenges in modality interference and temporal synchronization, providing a concrete design pattern that could be adopted or extended by the community. The end-to-end training claim and the streaming DiT component are also useful reference points for latency-sensitive applications.
major comments (3)
- [Abstract / §3 (Architecture)] Abstract and architecture description: The central claims that the Thinker-Talker split fully eliminates interference between text and speech modalities and that TMRoPE resolves timestamp misalignment rest on the assertion that these mechanisms succeed without hidden costs; however, the manuscript provides no ablation tables, stability metrics, or OOD evaluations that isolate their contributions versus scale or data effects.
- [Abstract / Results section] Results claims: The SOTA performance on Omni-Bench and the statement that end-to-end speech instruction following matches text performance on MMLU/GSM8K are reported without error bars, multiple runs, or explicit controls for data contamination, making it difficult to verify that the gains derive from the proposed block-wise encoders, interleaved sequencing, and dual-track Talker rather than training data or model size.
- [Abstract / Talker description] Streaming Talker evaluation: The claim that the sliding-window DiT Talker outperforms existing streaming and non-streaming alternatives in robustness and naturalness lacks quantitative latency measurements, robustness tests on out-of-distribution audio, or direct comparisons that control for the Thinker hidden-state input quality.
minor comments (2)
- [§3.1] The mathematical definition of TMRoPE (Time-aligned Multimodal RoPE) is described only in prose; adding an explicit equation would improve reproducibility and allow readers to verify the timestamp alignment logic.
- [Abstract / Evaluation] Benchmark names such as Omni-Bench are used without a brief definition or citation in the main text; a short footnote or reference would aid readers unfamiliar with the suite.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional evidence is needed and outlining targeted revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract / §3 (Architecture)] Abstract and architecture description: The central claims that the Thinker-Talker split fully eliminates interference between text and speech modalities and that TMRoPE resolves timestamp misalignment rest on the assertion that these mechanisms succeed without hidden costs; however, the manuscript provides no ablation tables, stability metrics, or OOD evaluations that isolate their contributions versus scale or data effects.
Authors: We agree that explicit ablations would strengthen the attribution of gains to the Thinker-Talker decoupling and TMRoPE. The current results rely on end-to-end comparisons against Qwen2.5-VL and Qwen2-Audio. In the revised manuscript we will add ablation tables that disable the dual-track Talker (forcing joint text-speech generation) and remove TMRoPE (replacing it with standard RoPE), reporting effects on modality interference, timestamp alignment accuracy, and downstream benchmark scores. We will also include training stability metrics (loss variance across seeds) and a small OOD test set for temporal misalignment. revision: yes
-
Referee: [Abstract / Results section] Results claims: The SOTA performance on Omni-Bench and the statement that end-to-end speech instruction following matches text performance on MMLU/GSM8K are reported without error bars, multiple runs, or explicit controls for data contamination, making it difficult to verify that the gains derive from the proposed block-wise encoders, interleaved sequencing, and dual-track Talker rather than training data or model size.
Authors: We acknowledge the value of statistical reporting. All numbers are from single training runs given the scale of end-to-end multimodal training. In revision we will report inference-time variance (multiple decoding seeds) with error bars on MMLU, GSM8K, and Omni-Bench. We will also add a paragraph detailing our data decontamination pipeline (exact overlap checks against benchmark test sets) and note that full multi-run training ablations are computationally prohibitive. These changes clarify the evaluation protocol without altering the reported point estimates. revision: partial
-
Referee: [Abstract / Talker description] Streaming Talker evaluation: The claim that the sliding-window DiT Talker outperforms existing streaming and non-streaming alternatives in robustness and naturalness lacks quantitative latency measurements, robustness tests on out-of-distribution audio, or direct comparisons that control for the Thinker hidden-state input quality.
Authors: We will expand the Talker evaluation section with concrete latency metrics (initial package delay, real-time factor, and end-to-end latency under streaming conditions). We will add robustness results on an OOD audio test set (noisy, accented, and code-switched samples) and include controlled comparisons that feed identical Thinker hidden states to both our sliding-window DiT and baseline decoders. These quantitative additions will be placed in a new subsection of the results. revision: yes
Circularity Check
No circularity: empirical benchmark claims with no derivations or self-referential predictions
full rationale
The paper is a technical report presenting Qwen2.5-Omni's architecture (block-wise encoders, TMRoPE, Thinker-Talker split, sliding-window DiT) and its observed performance on benchmarks like Omni-Bench, MMLU, and GSM8K. No equations, first-principles derivations, or 'predictions' are claimed that could reduce to fitted inputs or self-citations by construction. Self-references to prior Qwen models (e.g., Qwen2.5-VL, Qwen2-Audio) are standard comparisons and not load-bearing for the new elements, which are validated directly via empirical results rather than internal definitions. The central claims rest on external benchmark evaluations, making the report self-contained against external data without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard transformer attention and autoregressive generation assumptions hold for the interleaved multimodal inputs.
Forward citations
Cited by 60 Pith papers
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
Do Audio-Visual Large Language Models Really See and Hear?
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
TB-AVA uses text as a semantic anchor with a new Text-Bridged Audio-Visual Adapter and Gated Semantic Modulation to achieve state-of-the-art results on audio-visual benchmarks through parameter-efficient fine-tuning.
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
-
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
ProjRes achieves near-100% accuracy in membership inference on FedLLMs by measuring projection residuals of hidden embeddings on gradient subspaces, outperforming prior methods by up to 75.75% even under differential privacy.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
-
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
FSD50K-Solo: Automated Curation of Single-Source Sound Events
The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
-
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
-
Probing Cross-modal Information Hubs in Audio-Visual LLMs
AVLLMs encode integrated audio-visual information primarily in specialized cross-modal sink tokens, which enables a training-free hallucination mitigation approach.
-
Probing Cross-modal Information Hubs in Audio-Visual LLMs
AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.
-
Accelerating Compound LLM Training Workloads with Maestro
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
KARMA-MV: A Benchmark for Causal Question Answering on Music Videos
KARMA-MV is a new benchmark showing that causal knowledge graphs improve VLMs on causal audio-visual reasoning in music videos.
-
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
-
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
-
Exploring Audio Hallucination in Egocentric Video Understanding
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
-
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
-
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
-
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
-
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
-
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
-
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
-
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
Reference graph
Works this paper leans on
-
[1]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430,
work page internal anchor Pith review arXiv
-
[2]
Program Synthesis with Large Language Models
URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,
Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, et al. Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition. arXiv preprint arXiv:2407.04675,
-
[5]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Jo...
work page internal anchor Pith review arXiv
-
[6]
URL https://arxiv.org/abs/2501.06282. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE ...
-
[7]
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196, 2024b. Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5- tts: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv prep...
-
[8]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500,
work page internal anchor Pith review arXiv
-
[9]
Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, David Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al. Speechverse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295,
-
[10]
Lp-musiccaps: Llm-based pseudo music captioning,
SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372,
-
[11]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117,
work page internal anchor Pith review arXiv
-
[12]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp. 682–689. IEEE,
work page 2024
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394,
work page internal anchor Pith review arXiv
-
[15]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv:2405.21075, 2024a. Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, ...
work page internal anchor Pith review arXiv
-
[16]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,
-
[17]
URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information? arXiv preprint arXiv:2412.02611,
-
[18]
Meralion-audiollm: Technical report
16 Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F Chen, and Ai Ti Aw. Meralion-audiollm: Technical report. arXiv preprint arXiv:2412.09818,
-
[19]
Language is not all you need: Aligning perception with language models
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH ...
-
[20]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Infinigence. Infini-megrez-omni. URL https://github.com/infinigence/Infini-Megrez-Omni. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
ReferItGame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787– 798, Doha, Qatar, October
work page 2014
-
[22]
R efer I t G ame: Referring to objects in photographs of natural scenes
Association for Computational Linguistics. doi: 10.3115/v1/D14-1086. URL https://aclanthology.org/D14-1086/. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV,
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597,
work page internal anchor Pith review arXiv
-
[24]
Grounded language-image pre-training
URL https://arxiv.org/abs/2112.03857. Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report. arXiv preprint arXiv:2501.15368,
-
[25]
Omnibench: Towards the future of universal omni-language models
Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272, 2024b. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR
-
[26]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv:2304.08485, 2023b. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su,...
work page internal anchor Pith review arXiv
-
[27]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
URL https://arxiv.org/abs/2303.05499. Yuan Liu, Haodong Duan, Bo Li Yuanhan Zhang, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023c. Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoi...
-
[28]
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque
URL https://arxiv.org/ abs/1511.02283. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244,
-
[29]
URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report. CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever
URL https://openai.com/index/hello-gpt-4o/. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,
work page 2023
-
[31]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review arXiv
-
[32]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
URL https://arxiv.org/abs/2410.19168. Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,
work page internal anchor Pith review arXiv
-
[33]
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models. arXiv preprint arXiv:2406.15704,
-
[34]
SALMONN: towards generic hearing abilities for large language models
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024,
work page 2024
-
[35]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soum...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
On decoder-only architecture for speech-to-text and large language model integration
Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, and Yu Wu. On decoder-only architecture for speech-to-text and large language model integration. abs/2307.03917,
-
[38]
Mini-omni: Language models can hear, talk while thinking in streaming
URL https://x.ai/blog/grok-1.5v. Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725,
-
[39]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv:2407.10671, 2024a. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Y...
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv:2311.16502,
work page internal anchor Pith review arXiv
-
[41]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
19 Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813,
work page internal anchor Pith review arXiv
-
[42]
Anygpt: Unified multimodal llm with discrete sequence modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226,
-
[43]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? arXiv preprint arXiv:2408.13257,
-
[44]
Lyra: An efficient and speech-centric framework for omni-cognition.arXiv preprint arXiv:2412.09501,
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition. arXiv preprint arXiv:2412.09501,
-
[45]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv:2304.10592,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.