Audio-Visual Intelligence in Large Foundation Models
Pith reviewed 2026-05-06 04:04 UTC · model claude-opus-4-7
The pith
A unified taxonomy for audio-visual intelligence in the era of foundation models, mapping perception, generation, and interaction onto a single methodological scaffold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors argue that audio-visual intelligence (AVI), in the era of large foundation models, has matured into a coherent field that should be organized around three pillars — perception, generation, and interaction — and three methodological families — representation-centric, generation-centric, and LLM-centric. Within that scaffold they catalog tasks (from sound localization and AVQA to video-to-audio, talking heads, joint text-to-audio-video, omni dialogue, and audio-visual VLA models), datasets, and metrics, and then argue that the next frontier is not larger backbones but causal event-source grounding, action-conditioned audio-visual world models, layered context memory, causal editing
What carries the argument
A two-axis taxonomy: tasks split into perception (pixel, content, reasoning), generation (conditional, cross-modal, joint), and interaction (conversation, embodiment); methods split into representation-centric (contrastive/SSL/MAE, tokenization), generation-centric (GAN/diffusion/AR/MAR), and LLM-centric (encoder+LLM, LLM+generator, unified omni, agentic, VLA). The two axes are crossed to place every reviewed system, dataset, and benchmark in a single coordinate.
If this is right
- Future audio-visual systems should be evaluated not only on signal fidelity and synchronization but also on source grounding, off-screen and occluded sound, and physical/causal coherence.
- Joint audio-video generators built on shared diffusion or flow-matching backbones with cross-modal attention are positioned as the default architecture for text-to-audio-video and image-to-audio-video tasks.
- Embodied AI benefits from treating contact sound as a first-class control signal, motivating audio-augmented VLA policies and audio-aware world models rather than vision-only stacks.
- Verifier and reward models that score grounding, synchronization, and audio indispensability become reusable infrastructure for both benchmarking and preference-based post-training (DPO/GRPO).
- Long-form AVI requires hierarchical, provenance-aware memory across sensory, event, semantic, and task layers rather than longer raw context windows.
Where Pith is reading between the lines
- The clearest empirical gap surfaced by the survey is duration robustness: omni models that excel on short clips degrade sharply on multi-minute audio-video, suggesting that streaming memory architectures, not parameter scale, will drive the next generation of benchmarks.
- Because cascaded LLM+generator pipelines and unified omni models are converging on similar capabilities, the differentiator over the next cycle is likely to be controllability of local edits — object-, stem-, and identity-level interventions — rather than raw generation quality.
- The survey implicitly predicts that audio-visual reasoning will follow the trajectory of text reasoning: prompting and SFT first, then RL with grounded multimodal verifiers, with the verifier ecosystem itself becoming a research artifact.
- Treating sound as evidence about hidden sources (off-screen events, materials, geometry) rather than as a label aligned to a frame would push AVQA, V2A, and AV segmentation toward a shared event-source-graph formulation that the survey gestures at but does not formalize.
Load-bearing premise
That a single hierarchical taxonomy can cleanly partition a field whose strongest systems are increasingly unified omni models that deliberately blur the boundaries between perception, generation, and interaction.
What would settle it
Show that the proposed task/method taxonomy fails to place a meaningful share of recent omni or VLA systems without forcing them into multiple cells, or demonstrate that comparing methods along the surveyed axes (e.g., FAD/FVD/SyncNet/CLIP-style scores plus the listed benchmarks) systematically reorders model rankings relative to head-to-head human evaluation on the same tasks.
read the original abstract
Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a broad survey of Audio-Visual Intelligence (AVI) in the era of large foundation models. It proposes a taxonomy organized along three task pillars (perception, generation, interaction) and three methodological families (representation-centric, generation-centric, LLM-centric), and surveys methods, datasets, benchmarks, and applications across roughly a decade of literature, with particular emphasis on 2024–2026 omni and joint AV systems. The authors claim three primary contributions: (i) the first comprehensive survey of AVI through the lens of foundation models, (ii) a "principled" unified taxonomy, and (iii) a curated synthesis of methods, benchmarks, and open challenges, with a public resource page and extensive tables comparing systems on AVQA, AVS, V2A, T2AV, and AV navigation benchmarks.
Significance. If the synthesis holds, the manuscript provides a useful and timely consolidation of a fast-moving, fragmented area: it spans understanding, generation, and interaction in one document, and unlike narrower prior surveys (talking heads, V2A, AVQA, AVS, multimodal LLMs, VLA), it explicitly tracks the recent convergence toward omni AV systems and joint AV generators (e.g., JavisDiT family, Ovi, MOVA, UniAVGen, MMAudio, HunyuanVideo-Foley) and toward embodied AV (Audio-VLA, Sound of Simulation). The benchmark tables (e.g., Tables 6–10, 12, 17, 21) and the staged roadmap in Section 9 are concrete and reusable. The accompanying public list (Awesome-AVI) is a credit to reproducibility of the literature mapping. The figure-driven evolutionary tree (Fig. 1) and the layered open-challenges chapter (Sec. 9) are the strongest organizing contributions and would be cited as references in their own right.
major comments (4)
- [§3 / §4.3.3 / Fig. 1: taxonomy as partition vs. overlay] Contribution 2 claims a 'principled taxonomy' that 'organizes the diverse audio-visual tasks ... while clarifying task scope, assumptions, and relationships'. However, the manuscript's own evidence indicates that the taxonomy is not a partition: GPT-4o, Qwen3-Omni, Gemini 2.5/3, and Ming-Omni are catalogued under §4.3.3 (unified models), §5.2.3 (AVQA baselines, Table 7/10), §6.3 (joint generation), and §7.1.3 (omni dialogue); VLA systems recur in §4.3.5 and §7.2; Fig. 1 itself draws 'AVI-Today' as a merged node. This does not invalidate the survey, but the wording 'principled taxonomy' overclaims. Please either (a) explicitly reframe Section 3 as an organizing overlay with documented overlaps and a cross-reference matrix listing which cells each frontier system populates, or (b) define the partition by *primary modeling objective* and acknowledge multi-cell membership. As written, a read
- [§1.1 'First Comprehensive Survey'] The novelty claim 'first comprehensive review of AVI through the lens of large foundation models' is strong given the existence of recent surveys cited in this manuscript on talking heads, V2A, AVQA, AV segmentation, multimodal LLMs, and VLA. Please add a dedicated 'Related Surveys' subsection that lists the closest prior surveys, states what they cover, and identifies what is genuinely new here (likely: the joint perception+generation+interaction scope, the 2025–2026 omni/joint-AV coverage, and the action-conditioned AV world-model framing in §9.2). Without this, the headline contribution is ambiguous.
- [§5.2.3 / Table 7 and §6.3.1 / Table 17] Several quantitative tables mix evaluation protocols. Table 7 mixes zero-shot (MUSIC-AVQA, AVSD) and in-domain (VGGSound) protocols across rows; Table 17 mixes closed and open-source systems whose scoring (e.g., DS, LS) is reported with different prompt sets across cited works. Please (i) state explicitly which numbers are taken verbatim from cited papers vs. re-evaluated, (ii) flag rows where results are not strictly comparable, and (iii) provide column-level definitions and date stamps. As a reference document, the survey will be cited for these numbers, so cross-paper comparability needs to be auditable.
- [§9 / Table 24] Section 9 introduces six future-direction axes that are well-motivated, but several are presented as if they were uncovered insights of this survey (e.g., 'event-source graphs', 'action-conditioned AV world models', 'layered AV memory'). Some of these are already partially formalized in cited prior work (NAF, AV-NeRF, AVLMaps, world-model VLA literature). Please attribute existing antecedents more explicitly and clarify which parts are this survey's framing vs. extensions of established proposals. This will sharpen the contribution and avoid the impression that the agenda is being claimed wholesale.
minor comments (10)
- [Header / arXiv ID] The arXiv identifier '2605.04045' and date 'May 6, 2026' look inconsistent with the standard arXiv numbering scheme (which would be e.g. 2605 = May 2026 quarter, but the format suggests a placeholder). Please verify before camera-ready.
- [Fig. 1] The evolutionary tree is dense and several leaf labels overlap (e.g., 'L.L.&Learn', 'Sound of Pixels', 'A VTS'). Consider increasing typographic spacing, splitting into two figures (pre-2023 vs. 2023–2026), or providing a high-resolution supplementary version. Also, the 'AVI-Today' node should be defined in the caption.
- [§2.1, eq. shapes] Notation for downsampled audio features uses both ⌊F/s_f⌋ × ⌊L/s_t⌋ × d and a free-form fraction; please standardize. Similarly, R^{W/s × H/s × d} in §2.2 is inconsistent with R^{W×H×3} earlier.
- [§4.2.4] MAR is described as 'mask-and-predict decoding'; please cite Li et al. (MAR, 2024) explicitly if intended, since 'masked autoregressive' is a recently disputed name in the literature.
- [§5.1.4 Table 3] Some entries (e.g., WS-AVS row) have very low J/F values that look like a different metric scale; please confirm and add a footnote on whether weakly-supervised numbers are directly comparable to fully-supervised ones.
- [§6.3.1 Table 16/17] Several benchmark suites listed (T2AV-Compass, PhyAVBench, VABench) are cited but the exact prompt counts and audio coverage vary; a small clarifying paragraph on which subsets are used for Table 17 would help.
- [§7.1.3] AVI-Bench [583] is mentioned in a single sentence at the end but not placed in any table. Given that this is a survey, even a one-row entry in a benchmark table would help readers find it.
- [Throughout] The phrasing 'recent work', 'recent advances', and 'recent systems' is heavily reused. Where possible, replace with year stamps (e.g., '2025 systems') so the survey ages more gracefully.
- [References] Several entries are clearly preprints without venue; please mark arXiv vs. peer-reviewed consistently, and check duplicate entries for VideoLLaMA / Video-LLaMA series.
- [§8 (Applications)] Section 8 is largely descriptive and re-cites systems already discussed. Consider compressing it or converting it into a single comparative table mapping applications to required AVI capabilities; otherwise it adds length without adding structure.
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report and for recognizing the survey's scope, the evolutionary tree (Fig. 1), the staged roadmap of §9, and the public Awesome-AVI resource. We accept all four major comments and will revise accordingly. In brief: (1) We will reframe §3 as an organizing overlay rather than a partition, soften 'principled taxonomy' to 'unified organizing taxonomy', and add a new System × Cell cross-reference matrix that documents multi-cell membership for omni and VLA systems (GPT-4o, Gemini 2.5/3, Qwen2.5/3-Omni, Ming-Omni, Ovi, MOVA, UniAVGen, OpenVLA, π0, Audio-VLA, etc.). (2) We will add a dedicated 'Related Surveys and Differentiation' subsection that tabulates the closest prior surveys and states precisely what is new here (joint perception+generation+interaction scope, 2025–2026 omni/joint-AV coverage, action-conditioned AV world-model framing, verifier/reward agenda), and we will soften the 'first comprehensive survey' language to a defensible scoped claim. (3) We will repair the quantitative tables (Tables 7 and 17) by adding Protocol columns, verbatim-vs-re-evaluated provenance, comparability flags on non-comparable rows, full column-level metric definitions, and date stamps for leaderboard snapshots. (4) We will revise §9 and Table 24 to explicitly attribute antecedents (NAF, AV-NeRF, AV-GS, NeRAF, AVLMaps/MSLMaps, SoundSpaces, world-model VLA literature, DenseAV, CAV-MAE Sync) and to mark which elements are this survey's framing
read point-by-point responses
-
Referee: Taxonomy as partition vs. overlay: GPT-4o, Qwen3-Omni, Gemini, Ming-Omni, and VLA systems appear in multiple sections, and Fig. 1 draws an 'AVI-Today' merged node. The wording 'principled taxonomy' overclaims. Please reframe Section 3 as an organizing overlay with documented overlaps and a cross-reference matrix, or define the partition by primary modeling objective and acknowledge multi-cell membership.
Authors: We accept this point. The taxonomy is in fact an organizing overlay, not a partition, and several frontier omni and VLA systems intentionally populate multiple cells. In the revision we will: (i) replace the phrase 'principled taxonomy' in §1.1 and the abstract with 'unified organizing taxonomy' and explicitly state that cells are defined by primary modeling objective with acknowledged multi-cell membership; (ii) add a paragraph at the start of §3 stating the overlay semantics and the rule for primary-cell assignment; (iii) introduce a new cross-reference matrix (System × Cell) listing GPT-4o, Gemini 2.5/3, Qwen2.5/3-Omni, Ming-Omni, Ovi, JavisDiT/JavisDiT++, MOVA, UniAVGen, VideoPoet, OpenVLA, π0, Audio-VLA, RynnVLA-002, etc., with their primary cell and secondary memberships pointing to §4.3.3, §5.2.3, §6.3, §7.1.3, §4.3.5, and §7.2; and (iv) annotate Fig. 1's 'AVI-Today' node as a deliberate merged node representing convergence rather than a taxonomic class. This preserves the organizing value of the taxonomy while removing the partition overclaim. revision: yes
-
Referee: The 'first comprehensive survey' claim in §1.1 is strong given existing surveys on talking heads, V2A, AVQA, AVS, multimodal LLMs, and VLA. Please add a dedicated 'Related Surveys' subsection identifying what is genuinely new here.
Authors: We agree the headline claim needs explicit scoping. We will add a new subsection (proposed §1.2 'Related Surveys and Differentiation') that tabulates the closest prior surveys we already cite — covering talking-head synthesis, V2A, AVQA, AVS, multimodal LLMs, omni assistants, and VLA — listing for each its scope, time coverage, and modality span. Against this baseline we will state precisely what is new in our work: (a) a single-document scope spanning perception + generation + interaction under foundation-model paradigms; (b) coverage of 2025–2026 omni and joint AV systems (Qwen3-Omni, Ming-Omni, Gemini 2.5/3, Ovi, JavisDiT++, MOVA, UniAVGen, HunyuanVideo-Foley, MMAudio, LongCat-Flash-Omni); (c) the action-conditioned AV world-model framing in §9.2; and (d) the integrated verifier/reward and event-source grounding agenda in §9. We will soften 'first comprehensive review' to 'first comprehensive review jointly covering AV perception, generation, and interaction under the foundation-model paradigm', which is the defensible claim. revision: yes
-
Referee: Tables 7 and 17 mix evaluation protocols (zero-shot vs. in-domain; closed/open-source with different prompt sets and scoring). State which numbers are verbatim vs. re-evaluated, flag rows that are not strictly comparable, and provide column-level definitions and date stamps.
Authors: This is a fair concern for a reference document. We will revise the quantitative tables as follows. (i) For Table 7 (open-ended AVQA on MUSIC-AVQA/AVSD/VGGSound), we will add a Protocol column distinguishing zero-shot from in-domain evaluation per row, and explicitly note that all numbers are taken verbatim from the cited works (primarily following the VideoLLaMA2 evaluation protocol of Cheng et al.); rows where the evaluation protocol or judge model differs from this reference protocol will be marked with a dagger and footnoted as 'not strictly comparable'. (ii) For Table 17 (T2AV on T2AV-Compass), we will add a footnote stating that all DS/LS and quality scores are taken from the T2AV-Compass leaderboard under its unified prompt set and judge configuration, with a date stamp of the leaderboard snapshot used; closed-source systems will be flagged, and any row whose underlying generation cannot be re-run under the same prompts will be marked accordingly. (iii) We will add column-level definitions (VT, VA, PQ, CU, A-V, T-A, T-V, DS, LS) in the table caption or an adjacent legend, and add a similar legend for Table 7 metrics. (iv) We confirm that we have not re-evaluated any system ourselves in these tables; this will be stated explicitly. Where comparability cannot be repaired (e.g., proprietary judges or unreleased prompts), we will say so rather than imply a clean ranking. revision: yes
-
Referee: Section 9 / Table 24 presents future directions (event-source graphs, action-conditioned AV world models, layered AV memory) as if uncovered insights of this survey, while NAF, AV-NeRF, AVLMaps, and world-model VLA literature already partially formalize them. Please attribute antecedents and clarify what is the survey's framing vs. extension of established work.
Authors: We agree and will tighten attribution throughout §9. Specifically: (i) §9.1 will explicitly credit prior work on dense AV correspondence (DenseAV, CAV-MAE Sync) and event-aware V2A (Diff-Foley, FoleyCrafter, MMAudio, ThinkSound) as antecedents of the event-source grounding view; our framing contribution is the event-source-graph abstraction and the counterfactual training agenda, which we will mark as our extension. (ii) §9.2 will state up front that action-conditioned AV world models build directly on SoundSpaces/SoundSpaces 2.0, NAF, AV-NeRF, AV-GS, NeRAF, RAF, AVLMaps/MSLMaps, ManiWAV, Sound of Simulation, Audio-VLA, and RynnVLA-002, and that our contribution is the synthesis into a hybrid latent-dynamics + approximate-physics formulation rather than the world-model concept itself. (iii) §9.3 will note that hierarchical/episodic memory has antecedents in long-context VLM work and retrieval-augmented multimodal systems; our contribution is the AVI-specific layering with provenance-aware audio handles. (iv) Table 24 will be revised to add an 'Antecedents' column distinguishing established proposals from this survey's framing extensions. This sharpens the contribution and removes the impression of wholesale claiming. revision: yes
Circularity Check
Survey paper: no meaningful circularity. Self-citations are descriptive, not load-bearing for any derived prediction.
full rationale
This is a survey/review paper, not a derivation paper. It makes no first-principles predictions, fits no parameters, and does not claim to derive any quantity X from input Y. Its central deliverables are (1) a taxonomy of AVI tasks and methods, (2) curated benchmark/method tables, and (3) a future-research agenda. None of these have the structure that "circularity" applies to: there is no equation chain where a fitted input is renamed as a prediction, and no uniqueness theorem from prior author work is invoked to forbid alternatives. The paper does cite work by some of its own authors (e.g., JavisDiT, JavisDiT++, JavisGPT, JavisBench, OmniAVS, OISA, NExT-GPT, CoDi-2, Vitron). However, these citations function descriptively — listing the authors' own systems among many others in tables and method enumerations — rather than as load-bearing premises for any forced conclusion. The taxonomy claim ("first comprehensive", "principled taxonomy") is a rhetorical/organizational claim, not a derivation. The reader's skeptic concern (that the taxonomy is a descriptive overlay rather than a partition because flagship omni models recur across cells) is a correctness/usefulness concern about the taxonomy contribution, not a circularity concern: it does not show that any predicted quantity equals an input by construction. Per the rubric's hard rule 5 ("This is not standard consensus" is not circularity) and the guidance that surveys without fitted predictions should typically score 0–2, the appropriate score is low. There is mild self-citation (own benchmarks like JavisBench appear in evaluation tables alongside competitor benchmarks; own systems appear in method lists), but these citations are not used to force a uniqueness result, are externally checkable against the cited papers, and do not carry the weight of a derivation. Score: 1.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.