pith. machine review for the scientific record. sign in

arxiv: 2605.08723 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.MM

Recognition: 2 theorem links

· Lean Theorem

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:59 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords audio-visual video parsingweakly supervised learninguni-modal representationspseudo-label generationmulti-modal fusionevent localizationvideo analysis
0
0 comments X

The pith

Enhancing uni-modal representations improves weakly supervised audio-visual video parsing by better handling unaligned signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that weakly supervised audio-visual video parsing struggles because audio and visual signals in videos are often unaligned, making accurate perception of events in each modality separately more important than combining them. Current approaches focus heavily on multi-modal fusion for pseudo-labels and model training but neglect to guide and preserve the semantics of individual modalities, which leads to noisy labels and weaker localization. The proposed EAR framework addresses this by adding similarity-based label migration to annotate pre-training data for better uni-modal understanding and by using soft constraints to model uni-modal features in parallel with fusion. This dual focus allows the system to attend to both separate and combined representations. Experiments demonstrate gains over prior methods in generating cleaner pseudo-labels and in overall event parsing performance.

Core claim

The authors establish that a framework enhancing uni-modal representations for both the pseudo-label generator and the AVVP model, via similarity-based label migration to annotate pre-training data and soft-constrained uni-modal feature modeling alongside multi-modal fusion, enables coordinated attention to uni-modal and cross-modal representations and thereby boosts localization performance for audio, visual, and audio-visual events.

What carries the argument

The EAR framework, which applies similarity-based label migration for uni-modal event annotation in pre-training and soft constraints to refine uni-modal features during multi-modal fusion.

If this is right

  • The pseudo-label generator gains a better understanding of uni-modal events through similarity-based annotation of pre-training data.
  • The AVVP model refines uni-modal feature modeling in parallel with multi-modal fusion via soft constraints.
  • Coordinated attention to both uni-modal and cross-modal representations improves localization of audio, visual, and joint events.
  • The overall method outperforms prior state-of-the-art approaches in both pseudo-label quality and full AVVP performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same emphasis on preserving uni-modal semantics could be applied to other weakly supervised multi-modal tasks where signals are frequently unaligned.
  • Focusing first on individual modalities before fusion may reduce reliance on large volumes of precisely aligned training data.
  • Performance gains may vary with the degree of audio-visual misalignment in test videos, suggesting targeted evaluation on alignment-stratified datasets.

Load-bearing premise

Accurate video parsing fundamentally requires precise perception of uni-modal events, which existing multi-modal strategies fail to guide and preserve adequately because they overemphasize fusion.

What would settle it

An ablation experiment on standard AVVP benchmarks that removes the similarity-based label migration and soft-constrained uni-modal components and measures whether pseudo-label accuracy and event localization drop to or below prior state-of-the-art levels.

Figures

Figures reproduced from arXiv: 2605.08723 by Huilai Li, Jianqin Yin, Xiaomeng Di, Yiming Wang, Ying Xing, Yonghao Dang.

Figure 1
Figure 1. Figure 1: (a)Illustration of the AVVP. The testing data contains segment-level, modality-specific labels, while the training data [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EAR. In stage 1, pre-training is conducted using a large-scale, dense audio-visual event localization dataset [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of audio-visual video parsing with state-of-the-art methods. "GT" denotes the ground truth. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of uni-modal event dependencies learned [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of multi-modal event dependencies learned [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes EAR, a framework for weakly supervised Audio-Visual Video Parsing (AVVP) that enhances uni-modal representations in both the pseudo-label generator (via similarity-based label migration for pre-training data annotation) and the AVVP model (via soft-constrained uni-modal feature modeling in parallel with multi-modal fusion). The central claim is that this coordinated uni-modal and cross-modal approach yields superior pseudo-label quality and AVVP performance compared to prior methods that over-emphasize fusion.

Significance. If the results hold, the work usefully shifts focus toward preserving uni-modal semantics in unaligned audio-visual settings, where precise event perception is foundational. It offers a practical way to mitigate noisy pseudo-labels without discarding fusion benefits, which could inform more balanced architectures in multi-modal video understanding.

major comments (1)
  1. [Experimental results] Experimental results (likely §4 and associated tables): the reported gains in aggregate AVVP F1 and pseudo-label accuracy are not accompanied by separate audio-only or visual-only localization metrics, nor by ablations that hold the fusion module fixed while varying only the uni-modal components. Without these, it remains unclear whether the headline improvements stem from the claimed uni-modal enhancements or from incidental fusion refinements, directly undermining attribution to the core premise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comment below and are prepared to revise the paper accordingly.

read point-by-point responses
  1. Referee: Experimental results (likely §4 and associated tables): the reported gains in aggregate AVVP F1 and pseudo-label accuracy are not accompanied by separate audio-only or visual-only localization metrics, nor by ablations that hold the fusion module fixed while varying only the uni-modal components. Without these, it remains unclear whether the headline improvements stem from the claimed uni-modal enhancements or from incidental fusion refinements, directly undermining attribution to the core premise.

    Authors: We agree that additional granularity in the results would strengthen attribution to the uni-modal components. Although the EAR framework introduces explicit uni-modal enhancements (similarity-based label migration in the pseudo-label generator and soft-constrained uni-modal feature modeling parallel to fusion), the current experiments emphasize aggregate AVVP F1 and overall pseudo-label accuracy. In the revised manuscript, we will report separate audio-only and visual-only localization metrics. We will also add ablations that hold the fusion module fixed while varying only the uni-modal components. These changes will directly demonstrate the contribution of the uni-modal enhancements and clarify that the gains are not incidental to fusion refinements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method proposal without self-referential derivations or fitted predictions.

full rationale

The paper proposes a framework for uni-modal enhancement in weakly supervised AVVP via similarity-based label migration and soft-constrained modeling, supported by experimental outperformance claims. No equations, parameter-fitting steps presented as predictions, self-citations as load-bearing premises, or uniqueness theorems appear in the provided content. The central claims rest on empirical results rather than any derivation chain that reduces to its own inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical formulations, parameters, or explicit assumptions; full paper needed to populate ledger.

pith-pipeline@v0.9.0 · 5541 in / 993 out tokens · 34017 ms · 2026-05-12T02:59:14.510330+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

  1. [1]

    Event detection in field sports video using audio-visual features and a support vector machine,

    D. A. Sadlier and N. E. O’Connor, “Event detection in field sports video using audio-visual features and a support vector machine,”IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 10, pp. 1225–1233, 2005

  2. [2]

    A survey of content-aware video analysis for sports,

    H.-C. Shih, “A survey of content-aware video analysis for sports,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 5, pp. 1212–1231, 2017

  3. [3]

    Real-time multimodal human-avatar interaction,

    Y . Fu, R. Li, T. S. Huang, and M. Danielsen, “Real-time multimodal human-avatar interaction,”IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 4, pp. 467–477, 2008

  4. [4]

    Audio-visual automatic group affect analysis,

    G. Sharma, A. Dhall, and J. Cai, “Audio-visual automatic group affect analysis,”IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 1056–1069, 2021

  5. [5]

    Question-aware global-local video understanding network for audio-visual question answering,

    Z. Chen, L. Wang, P. Wang, and P. Gao, “Question-aware global-local video understanding network for audio-visual question answering,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, pp. 4109–4119, 2023

  6. [6]

    Language-guided joint audio-visual editing via one-shot adaptation,

    S. Liang, C. Huang, Y . Tian, A. Kumar, and C. Xu, “Language-guided joint audio-visual editing via one-shot adaptation,” inProc. Asian Conf. Comput. Vis. (ACCV), 2024, pp. 1011–1027. 11

  7. [7]

    Revisit weakly-supervised audio- visual video parsing from the language perspective,

    Y . Fan, Y . Wu, B. Du, and Y . Lin, “Revisit weakly-supervised audio- visual video parsing from the language perspective,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2023, pp. 40 610–40 622

  8. [8]

    Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,

    J. Zhou, D. Guo, Y . Zhong, and M. Wang, “Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,”Int. J. Comput. Vis., vol. 132, no. 11, pp. 5308–5329, 2024

  9. [9]

    Modality-independent teachers meet weakly-supervised audio-visual event parser,

    Y .-H. Lai, Y .-C. Chen, and F. Wang, “Modality-independent teachers meet weakly-supervised audio-visual event parser,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2023, pp. 73 633–73 651

  10. [10]

    Uwav: Uncertainty-weighted weakly-supervised audio- visual video parsing,

    Y . H. Lai, J. Ebbers, Y .-C. F. Wang, F. Germain, M. J. Jones, and M. Chatterjee, “Uwav: Uncertainty-weighted weakly-supervised audio- visual video parsing,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 13 561–13 570

  11. [11]

    Unified multisensory perception: Weakly- supervised audio-visual video parsing,

    Y . Tian, D. Li, and C. Xu, “Unified multisensory perception: Weakly- supervised audio-visual video parsing,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2020, pp. 436–454

  12. [12]

    Rethink cross-modal fusion in weakly- supervised audio-visual video parsing,

    Y . Xu, C. Hu, and G. H. Lee, “Rethink cross-modal fusion in weakly- supervised audio-visual video parsing,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2024, pp. 5615–5624

  13. [13]

    Coleaf: A contrastive- collaborative learning framework for weakly supervised audio-visual video parsing,

    F. Sardari, A. Mustafa, P. J. Jackson, and A. Hilton, “Coleaf: A contrastive- collaborative learning framework for weakly supervised audio-visual video parsing,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2024, pp. 1–17

  14. [14]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

  15. [15]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5

  16. [16]

    Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,

    T. Geng, T. Wang, J. Duan, R. Cong, and F. Zheng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 22 942–22 951

  17. [17]

    Link: Adaptive modality interaction for audio-visual video parsing,

    L. Wang, B. Zhu, Y . Chen, and J. Wang, “Link: Adaptive modality interaction for audio-visual video parsing,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025, pp. 1–5

  18. [18]

    Segment-level event perception with semantic dictionary for weakly supervised audio- visual video parsing,

    Z. Xie, Y . Yang, Y . Yu, J. Wang, Y . Liu, and Y . Jiang, “Segment-level event perception with semantic dictionary for weakly supervised audio- visual video parsing,”Knowl.-Based Syst., vol. 310, p. 112884, 2025

  19. [19]

    Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,

    Y . Wu and Y . Yang, “Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2021, pp. 1326–1335

  20. [20]

    Joint-modal label denoising for weakly-supervised audio-visual video parsing,

    H. Cheng, Z. Liu, H. Zhou, C. Qian, W. Wu, and L. Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in Proc. Europ. Conf. Comput. Vis. (ECCV), 2022, pp. 431–448

  21. [21]

    Weakly-supervised audio- visual video parsing with prototype-based pseudo-labeling,

    K. K. Rachavarapu, K. Ramakrishnanet al., “Weakly-supervised audio- visual video parsing with prototype-based pseudo-labeling,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2024, pp. 18 952–18 962

  22. [22]

    Cm- pie: Cross-modal perception for interactive-enhanced audio-visual video parsing,

    Y . Chen, R. Guo, X. Liu, P. Wu, G. Li, Z. Li, and W. Wang, “Cm- pie: Cross-modal perception for interactive-enhanced audio-visual video parsing,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 8421–8425

  23. [23]

    Multi-level signal fusion for enhanced weakly-supervised audio-visual video parsing,

    X. Sun, X. Wang, Q. Liu, and X. Zhou, “Multi-level signal fusion for enhanced weakly-supervised audio-visual video parsing,”IEEE Signal Process. Lett., vol. 31, pp. 1149–1153, 2024

  24. [24]

    Mug: Pseudo labeling augmented audio-visual mamba network for audio-visual video parsing,

    L. Wang, B. Zhu, Y . Chen, Y . Zhang, M. Tang, and J. Wang, “Mug: Pseudo labeling augmented audio-visual mamba network for audio-visual video parsing,”arXiv preprint arXiv:2507.01384, 2025

  25. [25]

    Teacher- guided pseudo supervision and cross-modal alignment for audio-visual video parsing,

    Y . Chen, R. Guo, L. Gao, Y . Xiang, Q. Luo, Z. Li, and W. Wang, “Teacher- guided pseudo supervision and cross-modal alignment for audio-visual video parsing,”arXiv preprint arXiv:2509.14097, 2025

  26. [26]

    Ten-catg: Text-enriched audio-visual video parsing with multi-scale category-aware temporal graph,

    Y . Chen, F. Sardari, P. Zhang, R. Guo, Y . Xiang, Z. Li, and W. Wang, “Ten-catg: Text-enriched audio-visual video parsing with multi-scale category-aware temporal graph,”arXiv preprint arXiv:2509.04086, 2025

  27. [27]

    Exploring event misalignment bias and segment focus bias for weakly-supervised audio-visual video parsing,

    M. Li, S. Han, and X. Yuan, “Exploring event misalignment bias and segment focus bias for weakly-supervised audio-visual video parsing,” inProc. Int. Conf. Big-data Serv. Intell. Comput., 2024, pp. 48–56

  28. [28]

    Text-infused audio-visual video parsing with semantic-aware multimodal contrastive learning,

    P. Zhao, Y . Chen, D. Guo, and Y . Yao, “Text-infused audio-visual video parsing with semantic-aware multimodal contrastive learning,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025, pp. 1–5

  29. [29]

    Temtg: Text-enhanced multi-hop temporal graph modeling for audio-visual video parsing,

    Y . Chen, P. Zhang, F. Li, F. Sardari, R. Guo, Z. Li, and W. Wang, “Temtg: Text-enhanced multi-hop temporal graph modeling for audio-visual video parsing,” inProc. 2025 Int. Conf. Multimed. Retr. (ICMR), 2025, pp. 1978–1982

  30. [30]

    Multimodal class-aware semantic enhancement network for audio-visual video parsing,

    P. Zhao, J. Zhou, Y . Zhao, D. Guo, and Y . Chen, “Multimodal class-aware semantic enhancement network for audio-visual video parsing,” inProc. AAAI Conf. Artif. Intell. (AAAI), 2025, pp. 10 448–10 456

  31. [31]

    Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception,

    J. Gao, M. Chen, and C. Xu, “Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 18 827– 18 836

  32. [32]

    Audio-visual event localization in unconstrained videos,

    Y . Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2018, pp. 247–263

  33. [33]

    Cross-modal relation- aware networks for audio-visual event localization,

    H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation- aware networks for audio-visual event localization,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2020, pp. 3893–3901

  34. [34]

    Dual perspective network for audio-visual event localization,

    V . Rao, M. I. Khalil, H. Li, P. Dai, and J. Lu, “Dual perspective network for audio-visual event localization,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2022, pp. 689–704

  35. [35]

    Cace-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,

    X. He, X. Liu, Y . Li, D. Zhao, G. Shen, Q. Kong, X. Yang, and Y . Zeng, “Cace-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2024, pp. 985–993

  36. [36]

    Audio-visual semantic graph network for audio-visual event localization,

    L. Liu, S. Li, and Y . Zhu, “Audio-visual semantic graph network for audio-visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 23 957–23 966

  37. [37]

    Dual attention matching for audio-visual event localization,

    Y . Wu, L. Zhu, Y . Yan, and Y . Yang, “Dual attention matching for audio-visual event localization,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6292–6300

  38. [38]

    Leveraging the video-level semantic consistency of event for audio-visual event localization,

    Y . Jiang, J. Yin, and Y . Dang, “Leveraging the video-level semantic consistency of event for audio-visual event localization,”IEEE Trans. Multimedia, vol. 26, pp. 4617–4627, 2023

  39. [39]

    Listen with seeing: Cross-modal contrastive learning for audio-visual event localization,

    C. Sun, M. Chen, C. Zhu, S. Zhang, P. Lu, and J. Chen, “Listen with seeing: Cross-modal contrastive learning for audio-visual event localization,”IEEE Trans. Multimedia, vol. 27, pp. 2650–2665, 2025

  40. [40]

    Locality-aware cross- modal correspondence learning for dense audio-visual events localization,

    L. Xing, H. Qu, R. Yan, X. Shu, and J. Tang, “Locality-aware cross- modal correspondence learning for dense audio-visual events localization,” arXiv preprint arXiv:2409.07967, 2024

  41. [41]

    Fasten: Video event localization based on audio-visual feature alignment and multi-scale temporal enhancement,

    Y . Liu, Q. Wu, M. Zeng, Y . Liu, and Y . Pan, “Fasten: Video event localization based on audio-visual feature alignment and multi-scale temporal enhancement,”IEEE Signal Process. Lett., vol. 32, pp. 2010– 2014, 2025

  42. [42]

    Del: Dense event localization for multi-modal audio-visual understanding,

    M. Ahmadian, A. Shirian, F. Guerin, and A. Gilbert, “Del: Dense event localization for multi-modal audio-visual understanding,”arXiv preprint arXiv:2506.23196, 2025

  43. [43]

    Temporal-aware multimodal event network for dense audio-visual event localization,

    Y . Han, M. Yang, S. Zhang, and X. Li, “Temporal-aware multimodal event network for dense audio-visual event localization,” inProc. Int. Conf. Comput. Vis. Robot. Autom. Eng. (CRAE), 2025, pp. 253–259

  44. [44]

    Esg-net: Event-aware semantic guided network for dense audio-visual event localization,

    H. Li, Y . Dang, Y . Xing, Y . Wang, and J. Yin, “Esg-net: Event-aware semantic guided network for dense audio-visual event localization,”arXiv preprint arXiv:2507.09945, 2025

  45. [45]

    Clasp: Cross- modal salient anchor-based semantic propagation for weakly-supervised dense audio-visual event localization,

    J. Zhou, Z. Zhou, Y . Zhou, Y . Mao, Z. Duan, and D. Guo, “Clasp: Cross- modal salient anchor-based semantic propagation for weakly-supervised dense audio-visual event localization,”arXiv preprint arXiv:2508.04566, 2025

  46. [46]

    Towards open-vocabulary audio-visual event localization,

    J. Zhou, D. Guo, R. Guo, Y . Mao, J. Hu, Y . Zhong, X. Chang, and M. Wang, “Towards open-vocabulary audio-visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 8362–8371

  47. [47]

    Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos,

    T. Geng, J. Zhang, Q. Wang, T. Wang, J. Duan, and F. Zheng, “Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 18 959–18 969

  48. [48]

    Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,

    Y . Tang, D. Shimada, J. Bi, and C. Xu, “Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,”arXiv preprint arXiv:2403.16276, 2024

  49. [49]

    Video captioning with attention-based lstm and semantic consistency,

    L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based lstm and semantic consistency,”IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017

  50. [50]

    Action-driven semantic representation and aggregation for video captioning,

    T. Han, Y . Xu, J. Yu, Z. Yu, and S. Zhao, “Action-driven semantic representation and aggregation for video captioning,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 4, pp. 3383–3395, 2024

  51. [51]

    Dense- captioning events in videos,

    R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense- captioning events in videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 706–715

  52. [52]

    End-to-end dense video captioning with masked transformer,

    L. Zhou, Y . Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 8739–8748. 12

  53. [53]

    End-to-end dense video captioning with parallel decoding,

    T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, “End-to-end dense video captioning with parallel decoding,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6847–6857

  54. [54]

    Streaming dense video captioning,

    X. Zhou, A. Arnab, S. Buch, S. Yan, A. Myers, X. Xiong, A. Nagrani, and C. Schmid, “Streaming dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2024, pp. 18 243–18 252

  55. [55]

    Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,

    A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 10 714–10 726

  56. [56]

    Multi-modal dense video captioning,

    V . Iashin and E. Rahtu, “Multi-modal dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) workshops, 2020, pp. 958–959

  57. [57]

    Multimodal representation fusion method for dense video captioning,

    H. Fang, Y . Li, and Y . Li, “Multimodal representation fusion method for dense video captioning,”Knowl.-Based Syst., vol. 324, p. 113856, 2025

  58. [58]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2016, pp. 770–778

  59. [59]

    A closer look at spatiotemporal convolutions for action recognition,

    D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 6450–6459

  60. [60]

    Cnn architectures for large-scale audio classification,

    S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seyboldet al., “Cnn architectures for large-scale audio classification,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017, pp. 131–135

  61. [61]

    Dual-modality seq2seq network for audio-visual event localization,

    Y .-B. Lin, Y .-J. Li, and Y .-C. F. Wang, “Dual-modality seq2seq network for audio-visual event localization,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 2002–2006

  62. [62]

    Cross-modal background suppression for audio- visual event localization,

    Y . Xia and Z. Zhao, “Cross-modal background suppression for audio- visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2022, pp. 19 989–19 998

  63. [63]

    Multi-modal grouping network for weakly-supervised audio-visual video parsing,

    S. Mo and Y . Tian, “Multi-modal grouping network for weakly-supervised audio-visual video parsing,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2022, pp. 34 722–34 733

  64. [64]

    Mm-pyramid: Multimodal pyramid attentional network for audio-visual event local- ization and video parsing,

    J. Yu, Y . Cheng, R.-W. Zhao, R. Feng, and Y . Zhang, “Mm-pyramid: Multimodal pyramid attentional network for audio-visual event local- ization and video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2022, pp. 6241–6249

  65. [65]

    Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,

    Y .-B. Lin, H.-Y . Tseng, H.-Y . Lee, Y .-Y . Lin, and M.-H. Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2021, pp. 11 449–11 461

  66. [66]

    Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,

    X. Jiang, X. Xu, Z. Chen, J. Zhang, J. Song, F. Shen, H. Lu, and H. T. Shen, “Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2022, pp. 719–727

  67. [67]

    Boosting positive segments for weakly- supervised audio-visual video parsing,

    K. K. Rachavarapuet al., “Boosting positive segments for weakly- supervised audio-visual video parsing,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 10 192–10 202

  68. [68]

    Multi-modal and multi-scale temporal fusion architecture search for audio-visual video parsing,

    J. Zhang and W. Li, “Multi-modal and multi-scale temporal fusion architecture search for audio-visual video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2023, pp. 3328–3336

  69. [69]

    Label- anticipated event disentanglement for audio-visual video parsing,

    J. Zhou, D. Guo, Y . Mao, Y . Zhong, X. Chang, and M. Wang, “Label- anticipated event disentanglement for audio-visual video parsing,” in Proc. Europ. Conf. Comput. Vis. (ECCV), 2024, pp. 35–51

  70. [70]

    Resisting noise in pseudo labels: Audible video event parsing with evidential learning,

    X. Jiang, X. Xu, L. Zhu, Z. Sun, A. Cichocki, and H. T. Shen, “Resisting noise in pseudo labels: Audible video event parsing with evidential learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 6, pp. 10 874–10 888, 2024. Huilai Lireceived the B.Eng. degree in 2021 from the Zhengzhou University, Zhengzhou, China. He is currently working toward th...

  71. [71]

    Her research interests include software testing and vulnerability detection

    She is currently an associate professor with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications. Her research interests include software testing and vulnerability detection. Yonghao Dangreceived Bachelor degree in computer science and technology from the University of Jinan, Jinan, China, in 2018, and ...

  72. [72]

    Yiming Wanggraduated from Hebei GEO University with a Bachelor degree, Shijiazhuang, China

    His research interests include computer vision, image processing, and deep learning. Yiming Wanggraduated from Hebei GEO University with a Bachelor degree, Shijiazhuang, China. He is currently working toward the Ph.D. degree with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China. His r...

  73. [73]

    AMDF" denotes the Asymmetric Audio/Visual- Driven Fusion module, and

    She is currently a Professor with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China. Her research interests include service robots, pattern recognition, machine learning, and image processing. 13 SUPPLEMENTARYMATERIAL In the supplementary material, we provide detailed informa- tion abo...