Audio flamingo 2: An audio-language model with long- audio understanding and expert reasoning abili- ties

· 2025 · arXiv 2503.03983

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

cs.SD · 2026-04-14 · unverdicted · novelty 6.0

SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

cs.CV · 2026-02-05 · unverdicted · novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

citing papers explorer

Showing 7 of 7 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 44
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 14
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 8
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding cs.SD · 2026-04-14 · unverdicted · none · ref 12
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 24
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 129
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization cs.CV · 2026-02-05 · unverdicted · none · ref 62
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

Audio flamingo 2: An audio-language model with long- audio understanding and expert reasoning abili- ties

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer