pith. sign in

arxiv: 2606.07033 · v1 · pith:I5TRUSWQnew · submitted 2026-06-05 · 💻 cs.AI · cs.CV

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

Pith reviewed 2026-06-27 22:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords open-vocabulary audio-visual event localizationheterogeneous graphhyperbolic spacebidirectional semantic constraintshierarchical entailment losstemporal localizationcross-modal fusion
0
0 comments X

The pith

A heterogeneous graph with bidirectional semantic constraints and hyperbolic mapping maintains consistency for audio-visual events unseen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to address open-vocabulary audio-visual event localization, where the model must recognize and localize events from categories absent during training. It builds a hierarchical graph connecting audio and visual segment nodes to their video-level counterparts, using multi-directional temporal edges within modalities and a gated fusion step that adds cross-modal information only under high alignment confidence. Bidirectional semantic constraints link segment- and video-level representations, after which the representations are mapped into hyperbolic space and regularized by a hierarchical entailment loss. A sympathetic reader would care because existing Euclidean-space methods lose consistency across temporal scales and semantic levels precisely when supervision for the target category is missing.

Core claim

The authors claim that a hierarchical semantic-constrained heterogeneous graph constructed in Euclidean space, equipped with bidirectional constraints between segment- and video-level nodes and a subsequent mapping to hyperbolic space regularized by a hierarchical entailment loss, produces the semantic consistency across levels and modalities required for open-vocabulary audio-visual event localization, yielding higher performance than prior methods on the OV-AVEL benchmark.

What carries the argument

The hierarchical semantic-constrained heterogeneous graph (HSCHG), which links audio-visual segment nodes to video-level nodes through temporal and cross-modal edges while enforcing bidirectional semantic constraints and hyperbolic hierarchical relationships via an entailment loss.

If this is right

  • The model integrates cross-modal cues selectively via dual-threshold gated fusion rather than unconditionally.
  • Semantic consistency is enforced both within each modality across time and between segment and video levels.
  • Hierarchical relationships between video and segment representations are captured explicitly in hyperbolic space.
  • The resulting representations support localization of events from categories absent from the training set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same combination of Euclidean graph construction and hyperbolic hierarchical regularization could be tested on other multi-modal temporal tasks such as audio-visual captioning or action segmentation with novel classes.
  • The dual-threshold filtering step may require domain-specific retuning of the thresholds when moving from the OV-AVEL benchmark to new data distributions.
  • If the entailment loss proves robust, it could serve as a drop-in regularizer for other open-vocabulary multi-modal problems that already use hyperbolic embeddings.

Load-bearing premise

That bidirectional semantic constraints together with the hierarchical entailment loss in hyperbolic space will enforce consistency even when the test events belong to categories that supplied no training examples.

What would settle it

An ablation on the OV-AVEL benchmark in which removing the bidirectional constraints or the hyperbolic entailment loss causes localization accuracy on unseen categories to fall to the level of existing Euclidean baselines.

Figures

Figures reproduced from arXiv: 2606.07033 by Hengyu Man, Hongtao Chen, Ruyi Zhang, Wangmeng Zuo, Wenrui Li, Xiaopeng Fan, Zhe Yang.

Figure 1
Figure 1. Figure 1: An illustration of the heterogeneous hierarchical graph in HSCHG. It [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of OV-AVEL. The model jointly leverages segment-level audio and visual cues to determine whether a target event that is both [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of HSCHG. We first extract features from audio, visual, and text inputs using a frozen pre-trained model. Subsequently, a heterogeneous [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter analysis on OV-AVEBench dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of our model. The red regions stand for the answer we predict. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Audio-visual cross-modal cosine similarity heatmap: The bright main [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: T-SNE visualization of audio and visual feature distributions at different stages of HSCHG. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The HSCHG embedding is used to represent the OV-AVEBench dataset on the Poincar [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Hierarchical Semantic-Constrained Heterogeneous Graph (HSCHG) framework for open-vocabulary audio-visual event localization (OV-AVEL). It constructs a heterogeneous hierarchical graph in Euclidean space with audio/visual segment and video-level nodes, multi-directional temporal edges, dual-threshold gated fusion for cross-modal information, bidirectional semantic constraints between levels, then maps representations and text prototypes to hyperbolic space with a hierarchical entailment regularization loss. The central claim is outperformance over existing methods on the OV-AVEL benchmark, with ablations validating component effectiveness.

Significance. If the results hold, the combination of heterogeneous graph structure with hyperbolic-space hierarchical entailment offers a structured way to enforce cross-level and cross-modal semantic consistency for unseen categories. The explicit treatment of both temporal multi-directionality and level-wise constraints is a positive step beyond standard Euclidean joint embeddings.

major comments (2)
  1. [Experiments] Experiments section: the outperformance claim on OV-AVEL for unseen categories rests on the full model but provides no ablation isolating the hierarchical entailment regularization loss (and hyperbolic mapping) from the Euclidean heterogeneous graph plus bidirectional constraints alone. This is load-bearing for the open-vocabulary result, as the graph structure may suffice without the hyperbolic component.
  2. [Method] Method section (proposed framework): the mechanism for generating text prototypes for unseen categories is not specified in sufficient detail to rule out supervision leakage, which directly affects the validity of the open-vocabulary consistency claim.
minor comments (1)
  1. Abstract: no quantitative results, error bars, dataset statistics, or baseline numbers are supplied, which weakens the reader's ability to gauge the practical magnitude of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the framework's potential significance. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the outperformance claim on OV-AVEL for unseen categories rests on the full model but provides no ablation isolating the hierarchical entailment regularization loss (and hyperbolic mapping) from the Euclidean heterogeneous graph plus bidirectional constraints alone. This is load-bearing for the open-vocabulary result, as the graph structure may suffice without the hyperbolic component.

    Authors: We agree that an explicit ablation isolating the hierarchical entailment regularization loss and hyperbolic mapping is necessary to substantiate the contribution to open-vocabulary performance. Our current ablations evaluate the heterogeneous graph construction, multi-directional edges, dual-threshold fusion, and bidirectional semantic constraints, but do not include a direct Euclidean-only variant without the hyperbolic component. We will add this ablation study to the revised Experiments section, reporting performance on the OV-AVEL unseen categories for the full model versus the Euclidean graph plus bidirectional constraints alone. revision: yes

  2. Referee: [Method] Method section (proposed framework): the mechanism for generating text prototypes for unseen categories is not specified in sufficient detail to rule out supervision leakage, which directly affects the validity of the open-vocabulary consistency claim.

    Authors: We acknowledge that additional detail is required. Text prototypes for unseen categories are obtained by passing the raw category name strings through a frozen pre-trained CLIP text encoder with no fine-tuning or exposure to the OV-AVEL training data or labels. This design precludes supervision leakage. We will expand the Method section with a new subsection explicitly describing the text prototype generation pipeline, the encoder choice, its frozen status, and the absence of any target-dataset supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: method description and experimental claims are self-contained without reductions to fitted inputs or self-referential definitions.

full rationale

The paper proposes a heterogeneous graph construction in Euclidean space followed by hyperbolic mapping and a hierarchical entailment loss, then reports outperformance on the OV-AVEL benchmark via experiments and ablations. No equations, parameter-fitting steps, or derivations are exhibited that reduce a claimed prediction or result to an input quantity by construction. The bidirectional constraints and entailment regularization are presented as architectural choices whose contribution is assessed externally via ablation, not defined in terms of the target consistency metric. Self-citations are absent from the provided text, and the open-vocabulary claim rests on benchmark evaluation rather than any self-definitional loop. This is the normal case of an empirical ML architecture paper whose central claims remain falsifiable outside its own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or invented entities; the described components (graphs, thresholds, loss) are treated as modeling choices rather than new postulates with independent evidence.

pith-pipeline@v0.9.1-grok · 5796 in / 1243 out tokens · 21781 ms · 2026-06-27T22:19:39.422858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 1 linked inside Pith

  1. [1]

    Multi-layer probabilistic association reasoning network for image-text retrieval.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(10):9706–9717, 2024

    Wenrui Li, Ruiqin Xiong, and Xiaopeng Fan. Multi-layer probabilistic association reasoning network for image-text retrieval.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(10):9706–9717, 2024

  2. [2]

    Fedpam: Federated personalized augmentation model for text- to-image retrieval

    Yueying Feng, Fan Ma, Wang Lin, Chang Yao, Jingyuan Chen, and Yi Yang. Fedpam: Federated personalized augmentation model for text- to-image retrieval. InProceedings of the 2024 International Conference on Multimedia Retrieval, pages 1185–1189, 2024

  3. [3]

    Neuron-based spiking transmission and reasoning network for robust image-text retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023

    Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023

  4. [4]

    Reservoir computing transformer for image- text retrieval

    Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Penghong Wang, Jinqiao Shi, and Xiaopeng Fan. Reservoir computing transformer for image- text retrieval. InProceedings of the 31st ACM International Conference on Multimedia, pages 5605–5613, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 10, JULY 2024 12

  5. [5]

    Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 34:3463–3474, 2025

    Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, and Heng Tao Shen. Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 34:3463–3474, 2025

  6. [6]

    Ump: Unified modality-aware prompt tuning for text- video retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11954–11964, 2024

    Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Ump: Unified modality-aware prompt tuning for text- video retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11954–11964, 2024

  7. [7]

    Exploiting unlabeled videos for video-text retrieval via pseudo-supervised learning.IEEE Transactions on Image Processing, 33:6748–6760, 2024

    Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Exploiting unlabeled videos for video-text retrieval via pseudo-supervised learning.IEEE Transactions on Image Processing, 33:6748–6760, 2024

  8. [8]

    Zero-shot video ground- ing with pseudo query lookup and verification.IEEE Transactions on Image Processing, 33:1643–1654, 2024

    Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Zero-shot video ground- ing with pseudo query lookup and verification.IEEE Transactions on Image Processing, 33:1643–1654, 2024

  9. [9]

    Aicl: Action in-context learning for text-to-video generation

    Jianzhi Liu, Junchen Zhu, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, and Jingkuan Song. Aicl: Action in-context learning for text-to-video generation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9414–9423, 2025

  10. [10]

    Spiking tucker fusion transformer for audio-visual zero-shot learning.IEEE Transactions on Image Processing, 33:4840–4852, 2024

    Wenrui Li, Penghong Wang, Ruiqin Xiong, and Xiaopeng Fan. Spiking tucker fusion transformer for audio-visual zero-shot learning.IEEE Transactions on Image Processing, 33:4840–4852, 2024

  11. [11]

    Multi-timescale motion-decoupled spiking transformer for audio-visual zero-shot learning.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

    Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, and Yonghong Tian. Multi-timescale motion-decoupled spiking transformer for audio-visual zero-shot learning.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

  12. [12]

    Motion-decoupled spiking transformer for audio- visual zero-shot learning

    Wenrui Li, Xi-Le Zhao, Zhengyu Ma, Xingtao Wang, Xiaopeng Fan, and Yonghong Tian. Motion-decoupled spiking transformer for audio- visual zero-shot learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 3994–4002, 2023

  13. [13]

    Multi-modal spiking tensor regression network for audio-visual zero-shot learning

    Zhe Yang, Wenrui Li, Jinxiu Hou, and Guanghui Cheng. Multi-modal spiking tensor regression network for audio-visual zero-shot learning. Neurocomputing, 629:129636, 2025

  14. [14]

    Spiking variational graph representation inference for video summarization.IEEE Transactions on Image Processing, 34:5697–5709, 2025

    Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, and Xiaopeng Fan. Spiking variational graph representation inference for video summarization.IEEE Transactions on Image Processing, 34:5697–5709, 2025

  15. [16]

    Dis- criminative cross-modality attention network for temporal inconsistent audio-visual event localization.IEEE Transactions on Image Processing, 30:7878–7888, 2021

    Hanyu Xuan, Lei Luo, Zhenyu Zhang, Jian Yang, and Yan Yan. Dis- criminative cross-modality attention network for temporal inconsistent audio-visual event localization.IEEE Transactions on Image Processing, 30:7878–7888, 2021

  16. [17]

    Cross-modal background suppression for audio-visual event localization

    Yan Xia and Zhou Zhao. Cross-modal background suppression for audio-visual event localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19989– 19998, June 2022

  17. [18]

    Dense modality interaction network for audio-visual event localization.IEEE Transactions on Multimedia, 25:2734–2748, 2023

    Shuo Liu, Weize Quan, Chaoqun Wang, Yuan Liu, Bin Liu, and Dong- Ming Yan. Dense modality interaction network for audio-visual event localization.IEEE Transactions on Multimedia, 25:2734–2748, 2023

  18. [19]

    Learning event-specific localization preferences for audio- visual event localization

    Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, and Qing Gu. Learning event-specific localization preferences for audio- visual event localization. InProceedings of the 31st ACM International Conference on Multimedia, pages 3446–3454, 2023

  19. [20]

    Towards open-vocabulary audio-visual event localization

    Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran Zhong, Xiaojun Chang, and Meng Wang. Towards open-vocabulary audio-visual event localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, June 2025

  20. [21]

    Ov-davel: Towards open-vocabulary dense audio-visual event localization in untrimmed videos

    Jiale Yu, Baopeng Zhang, Zhu Teng, and Jianping Fan. Ov-davel: Towards open-vocabulary dense audio-visual event localization in untrimmed videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 553–562, 2025

  21. [22]

    Poincar ´e embeddings for learn- ing hierarchical representations

    Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learn- ing hierarchical representations. InAdvances in Neural Information Processing Systems, volume 30, 2017

  22. [23]

    Hyperbolic audio-visual zero-shot learning

    Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, and Lars Petersson. Hyperbolic audio-visual zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7873–7883, October 2023

  23. [24]

    Shmamba: Structured hyperbolic state space model for audio-visual question answering.IEEE Transactions on Audio, Speech and Language Processing, 33:3582– 3593, 2025

    Zhe Yang, Wenrui Li, and Guanghui Cheng. Shmamba: Structured hyperbolic state space model for audio-visual question answering.IEEE Transactions on Audio, Speech and Language Processing, 33:3582– 3593, 2025

  24. [25]

    Css-net: A consistent segment selection network for audio-visual event localization

    Fan Feng, Yue Ming, Nannan Hu, Hui Yu, and Yuanan Liu. Css-net: A consistent segment selection network for audio-visual event localization. IEEE Transactions on Multimedia, 26:701–713, 2024

  25. [26]

    Listen with seeing: Cross-modal contrastive learning for audio- visual event localization.IEEE Transactions on Multimedia, 27:2650– 2665, 2025

    Chao Sun, Min Chen, Chuanbo Zhu, Sheng Zhang, Ping Lu, and Jincai Chen. Listen with seeing: Cross-modal contrastive learning for audio- visual event localization.IEEE Transactions on Multimedia, 27:2650– 2665, 2025

  26. [27]

    Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration

    Ziheng Zhou, Jinxing Zhou, Wei Qian, Shengeng Tang, Xiaojun Chang, and Dan Guo. Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10905–10913, April 2025

  27. [28]

    Fine-grained audio–visual event localization.IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2025

    Baoyu Fan, Lu Liu, Xiaochuan Li, Runze Zhang, Liang Jin, and Jin Zhang. Fine-grained audio–visual event localization.IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2025

  28. [29]

    Hierarchical attention networks for document classification

    Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Ed- uard Hovy. Hierarchical attention networks for document classification. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016

  29. [30]

    Hierarchical self-attention network for action localization in videos

    Rizard Renanda Adhi Pramono, Yie-Tarng Chen, and Wen-Hsien Fang. Hierarchical self-attention network for action localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2019

  30. [31]

    Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency

    Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 13821–13831, June 2022

  31. [32]

    Hyperbolic entailment cones for learning hierarchical embeddings

    Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. InProceedings of the 35th International Conference on Machine Learning, pages 1646– 1655, July 2018

  32. [33]

    Compositional entailment learning for hyperbolic vision-language models

    Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

  33. [34]

    Language-guided graph representation learning for video summarization.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

    Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan, and Yonghong Tian. Language-guided graph representation learning for video summarization.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

  34. [35]

    Hyperbolic geometry.Flavors of Geometry, 31:59–115, 1997

    James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry.Flavors of Geometry, 31:59–115, 1997

  35. [36]

    Inferring concept hierarchies from text corpora via hyperbolic embeddings

    Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3231–3241, July 2019

  36. [37]

    Cross-modal relation-aware networks for audio-visual event localization

    Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. Cross-modal relation-aware networks for audio-visual event localization. InProceedings of the 28th ACM International Conference on Multimedia, pages 3893–3901, 2020

  37. [38]

    Audio-visual event localization in unconstrained videos

    Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. InProceedings of the European Conference on Computer Vision, September 2018

  38. [39]

    Positive sample propagation along the audio-visual event line

    Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8436–8444, June 2021

  39. [40]

    Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing

    Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. InProceedings of the 30th ACM International Conference on Multimedia, pages 6241–6249, 2022

  40. [41]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, June 2023

  41. [42]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605, 2008

  42. [43]

    Hyperbolic-constraint point cloud reconstruction from single rgb-d images

    Wenrui Li, Zhe Yang, Wei Han, Hengyu Man, Xingtao Wang, and Xiaopeng Fan. Hyperbolic-constraint point cloud reconstruction from single rgb-d images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4959–4967, April 2025

  43. [44]

    Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2020

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2020