TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
hub Canonical reference
Kimi-Audio Technical Report
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
ICLAD combines in-context learning and comparison guidance in audio language models with a routing detector to boost generalization and explanations for audio deepfake detection, achieving up to 2x F1 gains on wild data.
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.
citing papers explorer
No citing papers match the current filters.