arxiv: 2605.08723 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.MM

Recognition: 2 theorem links

· Lean Theorem

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li , Xiaomeng Di , Ying Xing , Yonghao Dang , Yiming Wang , Jianqin Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:59 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords audio-visual video parsingweakly supervised learninguni-modal representationspseudo-label generationmulti-modal fusionevent localizationvideo analysis

0 comments

The pith

Enhancing uni-modal representations improves weakly supervised audio-visual video parsing by better handling unaligned signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that weakly supervised audio-visual video parsing struggles because audio and visual signals in videos are often unaligned, making accurate perception of events in each modality separately more important than combining them. Current approaches focus heavily on multi-modal fusion for pseudo-labels and model training but neglect to guide and preserve the semantics of individual modalities, which leads to noisy labels and weaker localization. The proposed EAR framework addresses this by adding similarity-based label migration to annotate pre-training data for better uni-modal understanding and by using soft constraints to model uni-modal features in parallel with fusion. This dual focus allows the system to attend to both separate and combined representations. Experiments demonstrate gains over prior methods in generating cleaner pseudo-labels and in overall event parsing performance.

Core claim

The authors establish that a framework enhancing uni-modal representations for both the pseudo-label generator and the AVVP model, via similarity-based label migration to annotate pre-training data and soft-constrained uni-modal feature modeling alongside multi-modal fusion, enables coordinated attention to uni-modal and cross-modal representations and thereby boosts localization performance for audio, visual, and audio-visual events.

What carries the argument

The EAR framework, which applies similarity-based label migration for uni-modal event annotation in pre-training and soft constraints to refine uni-modal features during multi-modal fusion.

If this is right

The pseudo-label generator gains a better understanding of uni-modal events through similarity-based annotation of pre-training data.
The AVVP model refines uni-modal feature modeling in parallel with multi-modal fusion via soft constraints.
Coordinated attention to both uni-modal and cross-modal representations improves localization of audio, visual, and joint events.
The overall method outperforms prior state-of-the-art approaches in both pseudo-label quality and full AVVP performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same emphasis on preserving uni-modal semantics could be applied to other weakly supervised multi-modal tasks where signals are frequently unaligned.
Focusing first on individual modalities before fusion may reduce reliance on large volumes of precisely aligned training data.
Performance gains may vary with the degree of audio-visual misalignment in test videos, suggesting targeted evaluation on alignment-stratified datasets.

Load-bearing premise

Accurate video parsing fundamentally requires precise perception of uni-modal events, which existing multi-modal strategies fail to guide and preserve adequately because they overemphasize fusion.

What would settle it

An ablation experiment on standard AVVP benchmarks that removes the similarity-based label migration and soft-constrained uni-modal components and measures whether pseudo-label accuracy and event localization drop to or below prior state-of-the-art levels.

Figures

Figures reproduced from arXiv: 2605.08723 by Huilai Li, Jianqin Yin, Xiaomeng Di, Yiming Wang, Ying Xing, Yonghao Dang.

**Figure 1.** Figure 1: (a)Illustration of the AVVP. The testing data contains segment-level, modality-specific labels, while the training data [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of EAR. In stage 1, pre-training is conducted using a large-scale, dense audio-visual event localization dataset [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of audio-visual video parsing with state-of-the-art methods. "GT" denotes the ground truth. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of uni-modal event dependencies learned [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of multi-modal event dependencies learned [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAR adds label migration and soft uni-modal constraints to AVVP but the results do not clearly tie gains to better uni-modal perception rather than fusion tweaks.

read the letter

The paper's main move is to push uni-modal representations harder in weakly supervised audio-visual video parsing. They argue that current paths either pre-train pseudo-label generators or refine fusion, but both let uni-modal semantics get noisy because audio and visual signals are unaligned. Their response is a similarity-based label migration step that annotates pre-training data to help the generator learn uni-modal events better, plus a soft constraint that refines uni-modal features in parallel with the multi-modal path. These are concrete additions that try to balance the two sides instead of letting fusion dominate.

Referee Report

1 major / 0 minor

Summary. The paper proposes EAR, a framework for weakly supervised Audio-Visual Video Parsing (AVVP) that enhances uni-modal representations in both the pseudo-label generator (via similarity-based label migration for pre-training data annotation) and the AVVP model (via soft-constrained uni-modal feature modeling in parallel with multi-modal fusion). The central claim is that this coordinated uni-modal and cross-modal approach yields superior pseudo-label quality and AVVP performance compared to prior methods that over-emphasize fusion.

Significance. If the results hold, the work usefully shifts focus toward preserving uni-modal semantics in unaligned audio-visual settings, where precise event perception is foundational. It offers a practical way to mitigate noisy pseudo-labels without discarding fusion benefits, which could inform more balanced architectures in multi-modal video understanding.

major comments (1)

[Experimental results] Experimental results (likely §4 and associated tables): the reported gains in aggregate AVVP F1 and pseudo-label accuracy are not accompanied by separate audio-only or visual-only localization metrics, nor by ablations that hold the fusion module fixed while varying only the uni-modal components. Without these, it remains unclear whether the headline improvements stem from the claimed uni-modal enhancements or from incidental fusion refinements, directly undermining attribution to the core premise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comment below and are prepared to revise the paper accordingly.

read point-by-point responses

Referee: Experimental results (likely §4 and associated tables): the reported gains in aggregate AVVP F1 and pseudo-label accuracy are not accompanied by separate audio-only or visual-only localization metrics, nor by ablations that hold the fusion module fixed while varying only the uni-modal components. Without these, it remains unclear whether the headline improvements stem from the claimed uni-modal enhancements or from incidental fusion refinements, directly undermining attribution to the core premise.

Authors: We agree that additional granularity in the results would strengthen attribution to the uni-modal components. Although the EAR framework introduces explicit uni-modal enhancements (similarity-based label migration in the pseudo-label generator and soft-constrained uni-modal feature modeling parallel to fusion), the current experiments emphasize aggregate AVVP F1 and overall pseudo-label accuracy. In the revised manuscript, we will report separate audio-only and visual-only localization metrics. We will also add ablations that hold the fusion module fixed while varying only the uni-modal components. These changes will directly demonstrate the contribution of the uni-modal enhancements and clarify that the gains are not incidental to fusion refinements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method proposal without self-referential derivations or fitted predictions.

full rationale

The paper proposes a framework for uni-modal enhancement in weakly supervised AVVP via similarity-based label migration and soft-constrained modeling, supported by experimental outperformance claims. No equations, parameter-fitting steps presented as predictions, self-citations as load-bearing premises, or uniqueness theorems appear in the provided content. The central claims rest on empirical results rather than any derivation chain that reduces to its own inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical formulations, parameters, or explicit assumptions; full paper needed to populate ledger.

pith-pipeline@v0.9.0 · 5541 in / 993 out tokens · 34017 ms · 2026-05-12T02:59:14.510330+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

similarity-based label migration... cosine similarity matrix... asymmetric audio/visual-driven fusion... multi-event relationship modeling
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

soft-constrained uni-modal modeling... BCE loss... MMIL pooling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

[1]

Event detection in field sports video using audio-visual features and a support vector machine,

D. A. Sadlier and N. E. O’Connor, “Event detection in field sports video using audio-visual features and a support vector machine,”IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 10, pp. 1225–1233, 2005

work page 2005
[2]

A survey of content-aware video analysis for sports,

H.-C. Shih, “A survey of content-aware video analysis for sports,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 5, pp. 1212–1231, 2017

work page 2017
[3]

Real-time multimodal human-avatar interaction,

Y . Fu, R. Li, T. S. Huang, and M. Danielsen, “Real-time multimodal human-avatar interaction,”IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 4, pp. 467–477, 2008

work page 2008
[4]

Audio-visual automatic group affect analysis,

G. Sharma, A. Dhall, and J. Cai, “Audio-visual automatic group affect analysis,”IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 1056–1069, 2021

work page 2021
[5]

Question-aware global-local video understanding network for audio-visual question answering,

Z. Chen, L. Wang, P. Wang, and P. Gao, “Question-aware global-local video understanding network for audio-visual question answering,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, pp. 4109–4119, 2023

work page 2023
[6]

Language-guided joint audio-visual editing via one-shot adaptation,

S. Liang, C. Huang, Y . Tian, A. Kumar, and C. Xu, “Language-guided joint audio-visual editing via one-shot adaptation,” inProc. Asian Conf. Comput. Vis. (ACCV), 2024, pp. 1011–1027. 11

work page 2024
[7]

Revisit weakly-supervised audio- visual video parsing from the language perspective,

Y . Fan, Y . Wu, B. Du, and Y . Lin, “Revisit weakly-supervised audio- visual video parsing from the language perspective,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2023, pp. 40 610–40 622

work page 2023
[8]

Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,

J. Zhou, D. Guo, Y . Zhong, and M. Wang, “Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,”Int. J. Comput. Vis., vol. 132, no. 11, pp. 5308–5329, 2024

work page 2024
[9]

Modality-independent teachers meet weakly-supervised audio-visual event parser,

Y .-H. Lai, Y .-C. Chen, and F. Wang, “Modality-independent teachers meet weakly-supervised audio-visual event parser,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2023, pp. 73 633–73 651

work page 2023
[10]

Uwav: Uncertainty-weighted weakly-supervised audio- visual video parsing,

Y . H. Lai, J. Ebbers, Y .-C. F. Wang, F. Germain, M. J. Jones, and M. Chatterjee, “Uwav: Uncertainty-weighted weakly-supervised audio- visual video parsing,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 13 561–13 570

work page 2025
[11]

Unified multisensory perception: Weakly- supervised audio-visual video parsing,

Y . Tian, D. Li, and C. Xu, “Unified multisensory perception: Weakly- supervised audio-visual video parsing,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2020, pp. 436–454

work page 2020
[12]

Rethink cross-modal fusion in weakly- supervised audio-visual video parsing,

Y . Xu, C. Hu, and G. H. Lee, “Rethink cross-modal fusion in weakly- supervised audio-visual video parsing,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2024, pp. 5615–5624

work page 2024
[13]

Coleaf: A contrastive- collaborative learning framework for weakly supervised audio-visual video parsing,

F. Sardari, A. Mustafa, P. J. Jackson, and A. Hilton, “Coleaf: A contrastive- collaborative learning framework for weakly supervised audio-visual video parsing,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2024, pp. 1–17

work page 2024
[14]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

work page 2021
[15]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5

work page 2023
[16]

Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,

T. Geng, T. Wang, J. Duan, R. Cong, and F. Zheng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 22 942–22 951

work page 2023
[17]

Link: Adaptive modality interaction for audio-visual video parsing,

L. Wang, B. Zhu, Y . Chen, and J. Wang, “Link: Adaptive modality interaction for audio-visual video parsing,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025, pp. 1–5

work page 2025
[18]

Segment-level event perception with semantic dictionary for weakly supervised audio- visual video parsing,

Z. Xie, Y . Yang, Y . Yu, J. Wang, Y . Liu, and Y . Jiang, “Segment-level event perception with semantic dictionary for weakly supervised audio- visual video parsing,”Knowl.-Based Syst., vol. 310, p. 112884, 2025

work page 2025
[19]

Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,

Y . Wu and Y . Yang, “Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2021, pp. 1326–1335

work page 2021
[20]

Joint-modal label denoising for weakly-supervised audio-visual video parsing,

H. Cheng, Z. Liu, H. Zhou, C. Qian, W. Wu, and L. Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in Proc. Europ. Conf. Comput. Vis. (ECCV), 2022, pp. 431–448

work page 2022
[21]

Weakly-supervised audio- visual video parsing with prototype-based pseudo-labeling,

K. K. Rachavarapu, K. Ramakrishnanet al., “Weakly-supervised audio- visual video parsing with prototype-based pseudo-labeling,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2024, pp. 18 952–18 962

work page 2024
[22]

Cm- pie: Cross-modal perception for interactive-enhanced audio-visual video parsing,

Y . Chen, R. Guo, X. Liu, P. Wu, G. Li, Z. Li, and W. Wang, “Cm- pie: Cross-modal perception for interactive-enhanced audio-visual video parsing,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 8421–8425

work page 2024
[23]

Multi-level signal fusion for enhanced weakly-supervised audio-visual video parsing,

X. Sun, X. Wang, Q. Liu, and X. Zhou, “Multi-level signal fusion for enhanced weakly-supervised audio-visual video parsing,”IEEE Signal Process. Lett., vol. 31, pp. 1149–1153, 2024

work page 2024
[24]

Mug: Pseudo labeling augmented audio-visual mamba network for audio-visual video parsing,

L. Wang, B. Zhu, Y . Chen, Y . Zhang, M. Tang, and J. Wang, “Mug: Pseudo labeling augmented audio-visual mamba network for audio-visual video parsing,”arXiv preprint arXiv:2507.01384, 2025

work page arXiv 2025
[25]

Teacher- guided pseudo supervision and cross-modal alignment for audio-visual video parsing,

Y . Chen, R. Guo, L. Gao, Y . Xiang, Q. Luo, Z. Li, and W. Wang, “Teacher- guided pseudo supervision and cross-modal alignment for audio-visual video parsing,”arXiv preprint arXiv:2509.14097, 2025

work page arXiv 2025
[26]

Ten-catg: Text-enriched audio-visual video parsing with multi-scale category-aware temporal graph,

Y . Chen, F. Sardari, P. Zhang, R. Guo, Y . Xiang, Z. Li, and W. Wang, “Ten-catg: Text-enriched audio-visual video parsing with multi-scale category-aware temporal graph,”arXiv preprint arXiv:2509.04086, 2025

work page arXiv 2025
[27]

Exploring event misalignment bias and segment focus bias for weakly-supervised audio-visual video parsing,

M. Li, S. Han, and X. Yuan, “Exploring event misalignment bias and segment focus bias for weakly-supervised audio-visual video parsing,” inProc. Int. Conf. Big-data Serv. Intell. Comput., 2024, pp. 48–56

work page 2024
[28]

Text-infused audio-visual video parsing with semantic-aware multimodal contrastive learning,

P. Zhao, Y . Chen, D. Guo, and Y . Yao, “Text-infused audio-visual video parsing with semantic-aware multimodal contrastive learning,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025, pp. 1–5

work page 2025
[29]

Temtg: Text-enhanced multi-hop temporal graph modeling for audio-visual video parsing,

Y . Chen, P. Zhang, F. Li, F. Sardari, R. Guo, Z. Li, and W. Wang, “Temtg: Text-enhanced multi-hop temporal graph modeling for audio-visual video parsing,” inProc. 2025 Int. Conf. Multimed. Retr. (ICMR), 2025, pp. 1978–1982

work page 2025
[30]

Multimodal class-aware semantic enhancement network for audio-visual video parsing,

P. Zhao, J. Zhou, Y . Zhao, D. Guo, and Y . Chen, “Multimodal class-aware semantic enhancement network for audio-visual video parsing,” inProc. AAAI Conf. Artif. Intell. (AAAI), 2025, pp. 10 448–10 456

work page 2025
[31]

Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception,

J. Gao, M. Chen, and C. Xu, “Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 18 827– 18 836

work page 2023
[32]

Audio-visual event localization in unconstrained videos,

Y . Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2018, pp. 247–263

work page 2018
[33]

Cross-modal relation- aware networks for audio-visual event localization,

H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation- aware networks for audio-visual event localization,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2020, pp. 3893–3901

work page 2020
[34]

Dual perspective network for audio-visual event localization,

V . Rao, M. I. Khalil, H. Li, P. Dai, and J. Lu, “Dual perspective network for audio-visual event localization,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2022, pp. 689–704

work page 2022
[35]

Cace-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,

X. He, X. Liu, Y . Li, D. Zhao, G. Shen, Q. Kong, X. Yang, and Y . Zeng, “Cace-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2024, pp. 985–993

work page 2024
[36]

Audio-visual semantic graph network for audio-visual event localization,

L. Liu, S. Li, and Y . Zhu, “Audio-visual semantic graph network for audio-visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 23 957–23 966

work page 2025
[37]

Dual attention matching for audio-visual event localization,

Y . Wu, L. Zhu, Y . Yan, and Y . Yang, “Dual attention matching for audio-visual event localization,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6292–6300

work page 2019
[38]

Leveraging the video-level semantic consistency of event for audio-visual event localization,

Y . Jiang, J. Yin, and Y . Dang, “Leveraging the video-level semantic consistency of event for audio-visual event localization,”IEEE Trans. Multimedia, vol. 26, pp. 4617–4627, 2023

work page 2023
[39]

Listen with seeing: Cross-modal contrastive learning for audio-visual event localization,

C. Sun, M. Chen, C. Zhu, S. Zhang, P. Lu, and J. Chen, “Listen with seeing: Cross-modal contrastive learning for audio-visual event localization,”IEEE Trans. Multimedia, vol. 27, pp. 2650–2665, 2025

work page 2025
[40]

Locality-aware cross- modal correspondence learning for dense audio-visual events localization,

L. Xing, H. Qu, R. Yan, X. Shu, and J. Tang, “Locality-aware cross- modal correspondence learning for dense audio-visual events localization,” arXiv preprint arXiv:2409.07967, 2024

work page arXiv 2024
[41]

Fasten: Video event localization based on audio-visual feature alignment and multi-scale temporal enhancement,

Y . Liu, Q. Wu, M. Zeng, Y . Liu, and Y . Pan, “Fasten: Video event localization based on audio-visual feature alignment and multi-scale temporal enhancement,”IEEE Signal Process. Lett., vol. 32, pp. 2010– 2014, 2025

work page 2010
[42]

Del: Dense event localization for multi-modal audio-visual understanding,

M. Ahmadian, A. Shirian, F. Guerin, and A. Gilbert, “Del: Dense event localization for multi-modal audio-visual understanding,”arXiv preprint arXiv:2506.23196, 2025

work page arXiv 2025
[43]

Temporal-aware multimodal event network for dense audio-visual event localization,

Y . Han, M. Yang, S. Zhang, and X. Li, “Temporal-aware multimodal event network for dense audio-visual event localization,” inProc. Int. Conf. Comput. Vis. Robot. Autom. Eng. (CRAE), 2025, pp. 253–259

work page 2025
[44]

Esg-net: Event-aware semantic guided network for dense audio-visual event localization,

H. Li, Y . Dang, Y . Xing, Y . Wang, and J. Yin, “Esg-net: Event-aware semantic guided network for dense audio-visual event localization,”arXiv preprint arXiv:2507.09945, 2025

work page arXiv 2025
[45]

Clasp: Cross- modal salient anchor-based semantic propagation for weakly-supervised dense audio-visual event localization,

J. Zhou, Z. Zhou, Y . Zhou, Y . Mao, Z. Duan, and D. Guo, “Clasp: Cross- modal salient anchor-based semantic propagation for weakly-supervised dense audio-visual event localization,”arXiv preprint arXiv:2508.04566, 2025

work page arXiv 2025
[46]

Towards open-vocabulary audio-visual event localization,

J. Zhou, D. Guo, R. Guo, Y . Mao, J. Hu, Y . Zhong, X. Chang, and M. Wang, “Towards open-vocabulary audio-visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 8362–8371

work page 2025
[47]

Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos,

T. Geng, J. Zhang, Q. Wang, T. Wang, J. Duan, and F. Zheng, “Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 18 959–18 969

work page 2025
[48]

Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,

Y . Tang, D. Shimada, J. Bi, and C. Xu, “Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,”arXiv preprint arXiv:2403.16276, 2024

work page arXiv 2024
[49]

Video captioning with attention-based lstm and semantic consistency,

L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based lstm and semantic consistency,”IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017

work page 2045
[50]

Action-driven semantic representation and aggregation for video captioning,

T. Han, Y . Xu, J. Yu, Z. Yu, and S. Zhao, “Action-driven semantic representation and aggregation for video captioning,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 4, pp. 3383–3395, 2024

work page 2024
[51]

Dense- captioning events in videos,

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense- captioning events in videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 706–715

work page 2017
[52]

End-to-end dense video captioning with masked transformer,

L. Zhou, Y . Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 8739–8748. 12

work page 2018
[53]

End-to-end dense video captioning with parallel decoding,

T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, “End-to-end dense video captioning with parallel decoding,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6847–6857

work page 2021
[54]

Streaming dense video captioning,

X. Zhou, A. Arnab, S. Buch, S. Yan, A. Myers, X. Xiong, A. Nagrani, and C. Schmid, “Streaming dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2024, pp. 18 243–18 252

work page 2024
[55]

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,

A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 10 714–10 726

work page 2023
[56]

Multi-modal dense video captioning,

V . Iashin and E. Rahtu, “Multi-modal dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) workshops, 2020, pp. 958–959

work page 2020
[57]

Multimodal representation fusion method for dense video captioning,

H. Fang, Y . Li, and Y . Li, “Multimodal representation fusion method for dense video captioning,”Knowl.-Based Syst., vol. 324, p. 113856, 2025

work page 2025
[58]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2016, pp. 770–778

work page 2016
[59]

A closer look at spatiotemporal convolutions for action recognition,

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 6450–6459

work page 2018
[60]

Cnn architectures for large-scale audio classification,

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seyboldet al., “Cnn architectures for large-scale audio classification,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017, pp. 131–135

work page 2017
[61]

Dual-modality seq2seq network for audio-visual event localization,

Y .-B. Lin, Y .-J. Li, and Y .-C. F. Wang, “Dual-modality seq2seq network for audio-visual event localization,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 2002–2006

work page 2019
[62]

Cross-modal background suppression for audio- visual event localization,

Y . Xia and Z. Zhao, “Cross-modal background suppression for audio- visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2022, pp. 19 989–19 998

work page 2022
[63]

Multi-modal grouping network for weakly-supervised audio-visual video parsing,

S. Mo and Y . Tian, “Multi-modal grouping network for weakly-supervised audio-visual video parsing,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2022, pp. 34 722–34 733

work page 2022
[64]

Mm-pyramid: Multimodal pyramid attentional network for audio-visual event local- ization and video parsing,

J. Yu, Y . Cheng, R.-W. Zhao, R. Feng, and Y . Zhang, “Mm-pyramid: Multimodal pyramid attentional network for audio-visual event local- ization and video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2022, pp. 6241–6249

work page 2022
[65]

Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,

Y .-B. Lin, H.-Y . Tseng, H.-Y . Lee, Y .-Y . Lin, and M.-H. Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2021, pp. 11 449–11 461

work page 2021
[66]

Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,

X. Jiang, X. Xu, Z. Chen, J. Zhang, J. Song, F. Shen, H. Lu, and H. T. Shen, “Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2022, pp. 719–727

work page 2022
[67]

Boosting positive segments for weakly- supervised audio-visual video parsing,

K. K. Rachavarapuet al., “Boosting positive segments for weakly- supervised audio-visual video parsing,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 10 192–10 202

work page 2023
[68]

Multi-modal and multi-scale temporal fusion architecture search for audio-visual video parsing,

J. Zhang and W. Li, “Multi-modal and multi-scale temporal fusion architecture search for audio-visual video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2023, pp. 3328–3336

work page 2023
[69]

Label- anticipated event disentanglement for audio-visual video parsing,

J. Zhou, D. Guo, Y . Mao, Y . Zhong, X. Chang, and M. Wang, “Label- anticipated event disentanglement for audio-visual video parsing,” in Proc. Europ. Conf. Comput. Vis. (ECCV), 2024, pp. 35–51

work page 2024
[70]

Resisting noise in pseudo labels: Audible video event parsing with evidential learning,

X. Jiang, X. Xu, L. Zhu, Z. Sun, A. Cichocki, and H. T. Shen, “Resisting noise in pseudo labels: Audible video event parsing with evidential learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 6, pp. 10 874–10 888, 2024. Huilai Lireceived the B.Eng. degree in 2021 from the Zhengzhou University, Zhengzhou, China. He is currently working toward th...

work page 2024
[71]

Her research interests include software testing and vulnerability detection

She is currently an associate professor with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications. Her research interests include software testing and vulnerability detection. Yonghao Dangreceived Bachelor degree in computer science and technology from the University of Jinan, Jinan, China, in 2018, and ...

work page 2018
[72]

Yiming Wanggraduated from Hebei GEO University with a Bachelor degree, Shijiazhuang, China

His research interests include computer vision, image processing, and deep learning. Yiming Wanggraduated from Hebei GEO University with a Bachelor degree, Shijiazhuang, China. He is currently working toward the Ph.D. degree with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China. His r...

work page
[73]

AMDF" denotes the Asymmetric Audio/Visual- Driven Fusion module, and

She is currently a Professor with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China. Her research interests include service robots, pattern recognition, machine learning, and image processing. 13 SUPPLEMENTARYMATERIAL In the supplementary material, we provide detailed informa- tion abo...

work page