Recognition: 2 theorem links
· Lean TheoremEAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
Pith reviewed 2026-05-12 02:59 UTC · model grok-4.3
The pith
Enhancing uni-modal representations improves weakly supervised audio-visual video parsing by better handling unaligned signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a framework enhancing uni-modal representations for both the pseudo-label generator and the AVVP model, via similarity-based label migration to annotate pre-training data and soft-constrained uni-modal feature modeling alongside multi-modal fusion, enables coordinated attention to uni-modal and cross-modal representations and thereby boosts localization performance for audio, visual, and audio-visual events.
What carries the argument
The EAR framework, which applies similarity-based label migration for uni-modal event annotation in pre-training and soft constraints to refine uni-modal features during multi-modal fusion.
If this is right
- The pseudo-label generator gains a better understanding of uni-modal events through similarity-based annotation of pre-training data.
- The AVVP model refines uni-modal feature modeling in parallel with multi-modal fusion via soft constraints.
- Coordinated attention to both uni-modal and cross-modal representations improves localization of audio, visual, and joint events.
- The overall method outperforms prior state-of-the-art approaches in both pseudo-label quality and full AVVP performance.
Where Pith is reading between the lines
- The same emphasis on preserving uni-modal semantics could be applied to other weakly supervised multi-modal tasks where signals are frequently unaligned.
- Focusing first on individual modalities before fusion may reduce reliance on large volumes of precisely aligned training data.
- Performance gains may vary with the degree of audio-visual misalignment in test videos, suggesting targeted evaluation on alignment-stratified datasets.
Load-bearing premise
Accurate video parsing fundamentally requires precise perception of uni-modal events, which existing multi-modal strategies fail to guide and preserve adequately because they overemphasize fusion.
What would settle it
An ablation experiment on standard AVVP benchmarks that removes the similarity-based label migration and soft-constrained uni-modal components and measures whether pseudo-label accuracy and event localization drop to or below prior state-of-the-art levels.
Figures
read the original abstract
Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EAR, a framework for weakly supervised Audio-Visual Video Parsing (AVVP) that enhances uni-modal representations in both the pseudo-label generator (via similarity-based label migration for pre-training data annotation) and the AVVP model (via soft-constrained uni-modal feature modeling in parallel with multi-modal fusion). The central claim is that this coordinated uni-modal and cross-modal approach yields superior pseudo-label quality and AVVP performance compared to prior methods that over-emphasize fusion.
Significance. If the results hold, the work usefully shifts focus toward preserving uni-modal semantics in unaligned audio-visual settings, where precise event perception is foundational. It offers a practical way to mitigate noisy pseudo-labels without discarding fusion benefits, which could inform more balanced architectures in multi-modal video understanding.
major comments (1)
- [Experimental results] Experimental results (likely §4 and associated tables): the reported gains in aggregate AVVP F1 and pseudo-label accuracy are not accompanied by separate audio-only or visual-only localization metrics, nor by ablations that hold the fusion module fixed while varying only the uni-modal components. Without these, it remains unclear whether the headline improvements stem from the claimed uni-modal enhancements or from incidental fusion refinements, directly undermining attribution to the core premise.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comment below and are prepared to revise the paper accordingly.
read point-by-point responses
-
Referee: Experimental results (likely §4 and associated tables): the reported gains in aggregate AVVP F1 and pseudo-label accuracy are not accompanied by separate audio-only or visual-only localization metrics, nor by ablations that hold the fusion module fixed while varying only the uni-modal components. Without these, it remains unclear whether the headline improvements stem from the claimed uni-modal enhancements or from incidental fusion refinements, directly undermining attribution to the core premise.
Authors: We agree that additional granularity in the results would strengthen attribution to the uni-modal components. Although the EAR framework introduces explicit uni-modal enhancements (similarity-based label migration in the pseudo-label generator and soft-constrained uni-modal feature modeling parallel to fusion), the current experiments emphasize aggregate AVVP F1 and overall pseudo-label accuracy. In the revised manuscript, we will report separate audio-only and visual-only localization metrics. We will also add ablations that hold the fusion module fixed while varying only the uni-modal components. These changes will directly demonstrate the contribution of the uni-modal enhancements and clarify that the gains are not incidental to fusion refinements. revision: yes
Circularity Check
No significant circularity; empirical method proposal without self-referential derivations or fitted predictions.
full rationale
The paper proposes a framework for uni-modal enhancement in weakly supervised AVVP via similarity-based label migration and soft-constrained modeling, supported by experimental outperformance claims. No equations, parameter-fitting steps presented as predictions, self-citations as load-bearing premises, or uniqueness theorems appear in the provided content. The central claims rest on empirical results rather than any derivation chain that reduces to its own inputs by construction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
similarity-based label migration... cosine similarity matrix... asymmetric audio/visual-driven fusion... multi-event relationship modeling
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
soft-constrained uni-modal modeling... BCE loss... MMIL pooling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Event detection in field sports video using audio-visual features and a support vector machine,
D. A. Sadlier and N. E. O’Connor, “Event detection in field sports video using audio-visual features and a support vector machine,”IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 10, pp. 1225–1233, 2005
work page 2005
-
[2]
A survey of content-aware video analysis for sports,
H.-C. Shih, “A survey of content-aware video analysis for sports,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 5, pp. 1212–1231, 2017
work page 2017
-
[3]
Real-time multimodal human-avatar interaction,
Y . Fu, R. Li, T. S. Huang, and M. Danielsen, “Real-time multimodal human-avatar interaction,”IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 4, pp. 467–477, 2008
work page 2008
-
[4]
Audio-visual automatic group affect analysis,
G. Sharma, A. Dhall, and J. Cai, “Audio-visual automatic group affect analysis,”IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 1056–1069, 2021
work page 2021
-
[5]
Question-aware global-local video understanding network for audio-visual question answering,
Z. Chen, L. Wang, P. Wang, and P. Gao, “Question-aware global-local video understanding network for audio-visual question answering,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, pp. 4109–4119, 2023
work page 2023
-
[6]
Language-guided joint audio-visual editing via one-shot adaptation,
S. Liang, C. Huang, Y . Tian, A. Kumar, and C. Xu, “Language-guided joint audio-visual editing via one-shot adaptation,” inProc. Asian Conf. Comput. Vis. (ACCV), 2024, pp. 1011–1027. 11
work page 2024
-
[7]
Revisit weakly-supervised audio- visual video parsing from the language perspective,
Y . Fan, Y . Wu, B. Du, and Y . Lin, “Revisit weakly-supervised audio- visual video parsing from the language perspective,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2023, pp. 40 610–40 622
work page 2023
-
[8]
Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,
J. Zhou, D. Guo, Y . Zhong, and M. Wang, “Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,”Int. J. Comput. Vis., vol. 132, no. 11, pp. 5308–5329, 2024
work page 2024
-
[9]
Modality-independent teachers meet weakly-supervised audio-visual event parser,
Y .-H. Lai, Y .-C. Chen, and F. Wang, “Modality-independent teachers meet weakly-supervised audio-visual event parser,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2023, pp. 73 633–73 651
work page 2023
-
[10]
Uwav: Uncertainty-weighted weakly-supervised audio- visual video parsing,
Y . H. Lai, J. Ebbers, Y .-C. F. Wang, F. Germain, M. J. Jones, and M. Chatterjee, “Uwav: Uncertainty-weighted weakly-supervised audio- visual video parsing,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 13 561–13 570
work page 2025
-
[11]
Unified multisensory perception: Weakly- supervised audio-visual video parsing,
Y . Tian, D. Li, and C. Xu, “Unified multisensory perception: Weakly- supervised audio-visual video parsing,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2020, pp. 436–454
work page 2020
-
[12]
Rethink cross-modal fusion in weakly- supervised audio-visual video parsing,
Y . Xu, C. Hu, and G. H. Lee, “Rethink cross-modal fusion in weakly- supervised audio-visual video parsing,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2024, pp. 5615–5624
work page 2024
-
[13]
F. Sardari, A. Mustafa, P. J. Jackson, and A. Hilton, “Coleaf: A contrastive- collaborative learning framework for weakly supervised audio-visual video parsing,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2024, pp. 1–17
work page 2024
-
[14]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763
work page 2021
-
[15]
Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5
work page 2023
-
[16]
Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,
T. Geng, T. Wang, J. Duan, R. Cong, and F. Zheng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 22 942–22 951
work page 2023
-
[17]
Link: Adaptive modality interaction for audio-visual video parsing,
L. Wang, B. Zhu, Y . Chen, and J. Wang, “Link: Adaptive modality interaction for audio-visual video parsing,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025, pp. 1–5
work page 2025
-
[18]
Z. Xie, Y . Yang, Y . Yu, J. Wang, Y . Liu, and Y . Jiang, “Segment-level event perception with semantic dictionary for weakly supervised audio- visual video parsing,”Knowl.-Based Syst., vol. 310, p. 112884, 2025
work page 2025
-
[19]
Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,
Y . Wu and Y . Yang, “Exploring heterogeneous clues for weakly- supervised audio-visual video parsing,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2021, pp. 1326–1335
work page 2021
-
[20]
Joint-modal label denoising for weakly-supervised audio-visual video parsing,
H. Cheng, Z. Liu, H. Zhou, C. Qian, W. Wu, and L. Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in Proc. Europ. Conf. Comput. Vis. (ECCV), 2022, pp. 431–448
work page 2022
-
[21]
Weakly-supervised audio- visual video parsing with prototype-based pseudo-labeling,
K. K. Rachavarapu, K. Ramakrishnanet al., “Weakly-supervised audio- visual video parsing with prototype-based pseudo-labeling,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2024, pp. 18 952–18 962
work page 2024
-
[22]
Cm- pie: Cross-modal perception for interactive-enhanced audio-visual video parsing,
Y . Chen, R. Guo, X. Liu, P. Wu, G. Li, Z. Li, and W. Wang, “Cm- pie: Cross-modal perception for interactive-enhanced audio-visual video parsing,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 8421–8425
work page 2024
-
[23]
Multi-level signal fusion for enhanced weakly-supervised audio-visual video parsing,
X. Sun, X. Wang, Q. Liu, and X. Zhou, “Multi-level signal fusion for enhanced weakly-supervised audio-visual video parsing,”IEEE Signal Process. Lett., vol. 31, pp. 1149–1153, 2024
work page 2024
-
[24]
Mug: Pseudo labeling augmented audio-visual mamba network for audio-visual video parsing,
L. Wang, B. Zhu, Y . Chen, Y . Zhang, M. Tang, and J. Wang, “Mug: Pseudo labeling augmented audio-visual mamba network for audio-visual video parsing,”arXiv preprint arXiv:2507.01384, 2025
-
[25]
Teacher- guided pseudo supervision and cross-modal alignment for audio-visual video parsing,
Y . Chen, R. Guo, L. Gao, Y . Xiang, Q. Luo, Z. Li, and W. Wang, “Teacher- guided pseudo supervision and cross-modal alignment for audio-visual video parsing,”arXiv preprint arXiv:2509.14097, 2025
-
[26]
Ten-catg: Text-enriched audio-visual video parsing with multi-scale category-aware temporal graph,
Y . Chen, F. Sardari, P. Zhang, R. Guo, Y . Xiang, Z. Li, and W. Wang, “Ten-catg: Text-enriched audio-visual video parsing with multi-scale category-aware temporal graph,”arXiv preprint arXiv:2509.04086, 2025
-
[27]
M. Li, S. Han, and X. Yuan, “Exploring event misalignment bias and segment focus bias for weakly-supervised audio-visual video parsing,” inProc. Int. Conf. Big-data Serv. Intell. Comput., 2024, pp. 48–56
work page 2024
-
[28]
Text-infused audio-visual video parsing with semantic-aware multimodal contrastive learning,
P. Zhao, Y . Chen, D. Guo, and Y . Yao, “Text-infused audio-visual video parsing with semantic-aware multimodal contrastive learning,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025, pp. 1–5
work page 2025
-
[29]
Temtg: Text-enhanced multi-hop temporal graph modeling for audio-visual video parsing,
Y . Chen, P. Zhang, F. Li, F. Sardari, R. Guo, Z. Li, and W. Wang, “Temtg: Text-enhanced multi-hop temporal graph modeling for audio-visual video parsing,” inProc. 2025 Int. Conf. Multimed. Retr. (ICMR), 2025, pp. 1978–1982
work page 2025
-
[30]
Multimodal class-aware semantic enhancement network for audio-visual video parsing,
P. Zhao, J. Zhou, Y . Zhao, D. Guo, and Y . Chen, “Multimodal class-aware semantic enhancement network for audio-visual video parsing,” inProc. AAAI Conf. Artif. Intell. (AAAI), 2025, pp. 10 448–10 456
work page 2025
-
[31]
J. Gao, M. Chen, and C. Xu, “Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 18 827– 18 836
work page 2023
-
[32]
Audio-visual event localization in unconstrained videos,
Y . Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2018, pp. 247–263
work page 2018
-
[33]
Cross-modal relation- aware networks for audio-visual event localization,
H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation- aware networks for audio-visual event localization,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2020, pp. 3893–3901
work page 2020
-
[34]
Dual perspective network for audio-visual event localization,
V . Rao, M. I. Khalil, H. Li, P. Dai, and J. Lu, “Dual perspective network for audio-visual event localization,” inProc. Europ. Conf. Comput. Vis. (ECCV), 2022, pp. 689–704
work page 2022
-
[35]
X. He, X. Liu, Y . Li, D. Zhao, G. Shen, Q. Kong, X. Yang, and Y . Zeng, “Cace-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2024, pp. 985–993
work page 2024
-
[36]
Audio-visual semantic graph network for audio-visual event localization,
L. Liu, S. Li, and Y . Zhu, “Audio-visual semantic graph network for audio-visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 23 957–23 966
work page 2025
-
[37]
Dual attention matching for audio-visual event localization,
Y . Wu, L. Zhu, Y . Yan, and Y . Yang, “Dual attention matching for audio-visual event localization,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6292–6300
work page 2019
-
[38]
Leveraging the video-level semantic consistency of event for audio-visual event localization,
Y . Jiang, J. Yin, and Y . Dang, “Leveraging the video-level semantic consistency of event for audio-visual event localization,”IEEE Trans. Multimedia, vol. 26, pp. 4617–4627, 2023
work page 2023
-
[39]
Listen with seeing: Cross-modal contrastive learning for audio-visual event localization,
C. Sun, M. Chen, C. Zhu, S. Zhang, P. Lu, and J. Chen, “Listen with seeing: Cross-modal contrastive learning for audio-visual event localization,”IEEE Trans. Multimedia, vol. 27, pp. 2650–2665, 2025
work page 2025
-
[40]
Locality-aware cross- modal correspondence learning for dense audio-visual events localization,
L. Xing, H. Qu, R. Yan, X. Shu, and J. Tang, “Locality-aware cross- modal correspondence learning for dense audio-visual events localization,” arXiv preprint arXiv:2409.07967, 2024
-
[41]
Y . Liu, Q. Wu, M. Zeng, Y . Liu, and Y . Pan, “Fasten: Video event localization based on audio-visual feature alignment and multi-scale temporal enhancement,”IEEE Signal Process. Lett., vol. 32, pp. 2010– 2014, 2025
work page 2010
-
[42]
Del: Dense event localization for multi-modal audio-visual understanding,
M. Ahmadian, A. Shirian, F. Guerin, and A. Gilbert, “Del: Dense event localization for multi-modal audio-visual understanding,”arXiv preprint arXiv:2506.23196, 2025
-
[43]
Temporal-aware multimodal event network for dense audio-visual event localization,
Y . Han, M. Yang, S. Zhang, and X. Li, “Temporal-aware multimodal event network for dense audio-visual event localization,” inProc. Int. Conf. Comput. Vis. Robot. Autom. Eng. (CRAE), 2025, pp. 253–259
work page 2025
-
[44]
Esg-net: Event-aware semantic guided network for dense audio-visual event localization,
H. Li, Y . Dang, Y . Xing, Y . Wang, and J. Yin, “Esg-net: Event-aware semantic guided network for dense audio-visual event localization,”arXiv preprint arXiv:2507.09945, 2025
-
[45]
J. Zhou, Z. Zhou, Y . Zhou, Y . Mao, Z. Duan, and D. Guo, “Clasp: Cross- modal salient anchor-based semantic propagation for weakly-supervised dense audio-visual event localization,”arXiv preprint arXiv:2508.04566, 2025
-
[46]
Towards open-vocabulary audio-visual event localization,
J. Zhou, D. Guo, R. Guo, Y . Mao, J. Hu, Y . Zhong, X. Chang, and M. Wang, “Towards open-vocabulary audio-visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 8362–8371
work page 2025
-
[47]
T. Geng, J. Zhang, Q. Wang, T. Wang, J. Duan, and F. Zheng, “Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2025, pp. 18 959–18 969
work page 2025
-
[48]
Y . Tang, D. Shimada, J. Bi, and C. Xu, “Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,”arXiv preprint arXiv:2403.16276, 2024
-
[49]
Video captioning with attention-based lstm and semantic consistency,
L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based lstm and semantic consistency,”IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017
work page 2045
-
[50]
Action-driven semantic representation and aggregation for video captioning,
T. Han, Y . Xu, J. Yu, Z. Yu, and S. Zhao, “Action-driven semantic representation and aggregation for video captioning,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 4, pp. 3383–3395, 2024
work page 2024
-
[51]
Dense- captioning events in videos,
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense- captioning events in videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 706–715
work page 2017
-
[52]
End-to-end dense video captioning with masked transformer,
L. Zhou, Y . Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 8739–8748. 12
work page 2018
-
[53]
End-to-end dense video captioning with parallel decoding,
T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, “End-to-end dense video captioning with parallel decoding,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6847–6857
work page 2021
-
[54]
Streaming dense video captioning,
X. Zhou, A. Arnab, S. Buch, S. Yan, A. Myers, X. Xiong, A. Nagrani, and C. Schmid, “Streaming dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2024, pp. 18 243–18 252
work page 2024
-
[55]
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,
A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 10 714–10 726
work page 2023
-
[56]
Multi-modal dense video captioning,
V . Iashin and E. Rahtu, “Multi-modal dense video captioning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) workshops, 2020, pp. 958–959
work page 2020
-
[57]
Multimodal representation fusion method for dense video captioning,
H. Fang, Y . Li, and Y . Li, “Multimodal representation fusion method for dense video captioning,”Knowl.-Based Syst., vol. 324, p. 113856, 2025
work page 2025
-
[58]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2016, pp. 770–778
work page 2016
-
[59]
A closer look at spatiotemporal convolutions for action recognition,
D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2018, pp. 6450–6459
work page 2018
-
[60]
Cnn architectures for large-scale audio classification,
S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seyboldet al., “Cnn architectures for large-scale audio classification,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017, pp. 131–135
work page 2017
-
[61]
Dual-modality seq2seq network for audio-visual event localization,
Y .-B. Lin, Y .-J. Li, and Y .-C. F. Wang, “Dual-modality seq2seq network for audio-visual event localization,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 2002–2006
work page 2019
-
[62]
Cross-modal background suppression for audio- visual event localization,
Y . Xia and Z. Zhao, “Cross-modal background suppression for audio- visual event localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2022, pp. 19 989–19 998
work page 2022
-
[63]
Multi-modal grouping network for weakly-supervised audio-visual video parsing,
S. Mo and Y . Tian, “Multi-modal grouping network for weakly-supervised audio-visual video parsing,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2022, pp. 34 722–34 733
work page 2022
-
[64]
J. Yu, Y . Cheng, R.-W. Zhao, R. Feng, and Y . Zhang, “Mm-pyramid: Multimodal pyramid attentional network for audio-visual event local- ization and video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2022, pp. 6241–6249
work page 2022
-
[65]
Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,
Y .-B. Lin, H.-Y . Tseng, H.-Y . Lee, Y .-Y . Lin, and M.-H. Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” inProc. Adv. Neural Inform. Process. Syst. (NeurIPS), 2021, pp. 11 449–11 461
work page 2021
-
[66]
Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,
X. Jiang, X. Xu, Z. Chen, J. Zhang, J. Song, F. Shen, H. Lu, and H. T. Shen, “Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2022, pp. 719–727
work page 2022
-
[67]
Boosting positive segments for weakly- supervised audio-visual video parsing,
K. K. Rachavarapuet al., “Boosting positive segments for weakly- supervised audio-visual video parsing,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 10 192–10 202
work page 2023
-
[68]
Multi-modal and multi-scale temporal fusion architecture search for audio-visual video parsing,
J. Zhang and W. Li, “Multi-modal and multi-scale temporal fusion architecture search for audio-visual video parsing,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2023, pp. 3328–3336
work page 2023
-
[69]
Label- anticipated event disentanglement for audio-visual video parsing,
J. Zhou, D. Guo, Y . Mao, Y . Zhong, X. Chang, and M. Wang, “Label- anticipated event disentanglement for audio-visual video parsing,” in Proc. Europ. Conf. Comput. Vis. (ECCV), 2024, pp. 35–51
work page 2024
-
[70]
Resisting noise in pseudo labels: Audible video event parsing with evidential learning,
X. Jiang, X. Xu, L. Zhu, Z. Sun, A. Cichocki, and H. T. Shen, “Resisting noise in pseudo labels: Audible video event parsing with evidential learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 6, pp. 10 874–10 888, 2024. Huilai Lireceived the B.Eng. degree in 2021 from the Zhengzhou University, Zhengzhou, China. He is currently working toward th...
work page 2024
-
[71]
Her research interests include software testing and vulnerability detection
She is currently an associate professor with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications. Her research interests include software testing and vulnerability detection. Yonghao Dangreceived Bachelor degree in computer science and technology from the University of Jinan, Jinan, China, in 2018, and ...
work page 2018
-
[72]
Yiming Wanggraduated from Hebei GEO University with a Bachelor degree, Shijiazhuang, China
His research interests include computer vision, image processing, and deep learning. Yiming Wanggraduated from Hebei GEO University with a Bachelor degree, Shijiazhuang, China. He is currently working toward the Ph.D. degree with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China. His r...
-
[73]
AMDF" denotes the Asymmetric Audio/Visual- Driven Fusion module, and
She is currently a Professor with the School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China. Her research interests include service robots, pattern recognition, machine learning, and image processing. 13 SUPPLEMENTARYMATERIAL In the supplementary material, we provide detailed informa- tion abo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.