Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

Hengyu Man; Hongtao Chen; Ruyi Zhang; Wangmeng Zuo; Wenrui Li; Xiaopeng Fan; Zhe Yang

arxiv: 2606.07033 · v1 · pith:I5TRUSWQnew · submitted 2026-06-05 · 💻 cs.AI · cs.CV

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

Zhe Yang , Ruyi Zhang , Hongtao Chen , Wenrui Li , Hengyu Man , Wangmeng Zuo , Xiaopeng Fan This is my paper

Pith reviewed 2026-06-27 22:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords open-vocabulary audio-visual event localizationheterogeneous graphhyperbolic spacebidirectional semantic constraintshierarchical entailment losstemporal localizationcross-modal fusion

0 comments

The pith

A heterogeneous graph with bidirectional semantic constraints and hyperbolic mapping maintains consistency for audio-visual events unseen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to address open-vocabulary audio-visual event localization, where the model must recognize and localize events from categories absent during training. It builds a hierarchical graph connecting audio and visual segment nodes to their video-level counterparts, using multi-directional temporal edges within modalities and a gated fusion step that adds cross-modal information only under high alignment confidence. Bidirectional semantic constraints link segment- and video-level representations, after which the representations are mapped into hyperbolic space and regularized by a hierarchical entailment loss. A sympathetic reader would care because existing Euclidean-space methods lose consistency across temporal scales and semantic levels precisely when supervision for the target category is missing.

Core claim

The authors claim that a hierarchical semantic-constrained heterogeneous graph constructed in Euclidean space, equipped with bidirectional constraints between segment- and video-level nodes and a subsequent mapping to hyperbolic space regularized by a hierarchical entailment loss, produces the semantic consistency across levels and modalities required for open-vocabulary audio-visual event localization, yielding higher performance than prior methods on the OV-AVEL benchmark.

What carries the argument

The hierarchical semantic-constrained heterogeneous graph (HSCHG), which links audio-visual segment nodes to video-level nodes through temporal and cross-modal edges while enforcing bidirectional semantic constraints and hyperbolic hierarchical relationships via an entailment loss.

If this is right

The model integrates cross-modal cues selectively via dual-threshold gated fusion rather than unconditionally.
Semantic consistency is enforced both within each modality across time and between segment and video levels.
Hierarchical relationships between video and segment representations are captured explicitly in hyperbolic space.
The resulting representations support localization of events from categories absent from the training set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of Euclidean graph construction and hyperbolic hierarchical regularization could be tested on other multi-modal temporal tasks such as audio-visual captioning or action segmentation with novel classes.
The dual-threshold filtering step may require domain-specific retuning of the thresholds when moving from the OV-AVEL benchmark to new data distributions.
If the entailment loss proves robust, it could serve as a drop-in regularizer for other open-vocabulary multi-modal problems that already use hyperbolic embeddings.

Load-bearing premise

That bidirectional semantic constraints together with the hierarchical entailment loss in hyperbolic space will enforce consistency even when the test events belong to categories that supplied no training examples.

What would settle it

An ablation on the OV-AVEL benchmark in which removing the bidirectional constraints or the hyperbolic entailment loss causes localization accuracy on unseen categories to fall to the level of existing Euclidean baselines.

Figures

Figures reproduced from arXiv: 2606.07033 by Hengyu Man, Hongtao Chen, Ruyi Zhang, Wangmeng Zuo, Wenrui Li, Xiaopeng Fan, Zhe Yang.

**Figure 2.** Figure 2: An illustration of OV-AVEL. The model jointly leverages segment-level audio and visual cues to determine whether a target event that is both [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of HSCHG. We first extract features from audio, visual, and text inputs using a frozen pre-trained model. Subsequently, a heterogeneous [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter analysis on OV-AVEBench dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of our model. The red regions stand for the answer we predict. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Audio-visual cross-modal cosine similarity heatmap: The bright main [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: T-SNE visualization of audio and visual feature distributions at different stages of HSCHG. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The HSCHG embedding is used to represent the OV-AVEBench dataset on the Poincar [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HSCHG adds a heterogeneous hierarchical graph with gated cross-modal fusion and hyperbolic entailment regularization to tackle consistency issues in open-vocabulary audio-visual event localization.

read the letter

The paper's main contribution is a concrete architecture for OV-AVEL that builds a heterogeneous graph over audio and visual nodes at both segment and video levels, adds multi-directional temporal edges, applies dual-threshold gated fusion to control cross-modal flow, imposes bidirectional semantic constraints between levels, and then maps everything into hyperbolic space for a hierarchical entailment loss.

This setup directly targets the two problems stated in the abstract: maintaining audio-visual consistency without supervision on unseen classes and enforcing consistency across temporal scales. The graph construction and fusion rule look like a reasonable way to limit noisy cross-modal mixing, and the hyperbolic step is a plausible way to encode the video-segment hierarchy.

The experiments claim outperformance on the OV-AVEL benchmark plus component ablations. That is the right kind of evidence to supply.

The soft spot is exactly the one in the stress-test note. The abstract and description do not isolate whether the hyperbolic entailment term is what drives gains on unseen categories or whether the Euclidean graph plus constraints would have been enough. Without that breakdown, it is hard to judge how load-bearing the new regularization actually is. The paper would be stronger with an explicit ablation that removes only the hyperbolic mapping and loss while keeping everything else fixed.

This work is aimed at researchers already working on audio-visual localization and open-vocabulary multimodal tasks. It is a focused engineering contribution rather than a broad theoretical advance.

I would send it to peer review. The core idea is clear enough and the problem is real; referees can check the missing ablation and the precise handling of unseen text prototypes.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Hierarchical Semantic-Constrained Heterogeneous Graph (HSCHG) framework for open-vocabulary audio-visual event localization (OV-AVEL). It constructs a heterogeneous hierarchical graph in Euclidean space with audio/visual segment and video-level nodes, multi-directional temporal edges, dual-threshold gated fusion for cross-modal information, bidirectional semantic constraints between levels, then maps representations and text prototypes to hyperbolic space with a hierarchical entailment regularization loss. The central claim is outperformance over existing methods on the OV-AVEL benchmark, with ablations validating component effectiveness.

Significance. If the results hold, the combination of heterogeneous graph structure with hyperbolic-space hierarchical entailment offers a structured way to enforce cross-level and cross-modal semantic consistency for unseen categories. The explicit treatment of both temporal multi-directionality and level-wise constraints is a positive step beyond standard Euclidean joint embeddings.

major comments (2)

[Experiments] Experiments section: the outperformance claim on OV-AVEL for unseen categories rests on the full model but provides no ablation isolating the hierarchical entailment regularization loss (and hyperbolic mapping) from the Euclidean heterogeneous graph plus bidirectional constraints alone. This is load-bearing for the open-vocabulary result, as the graph structure may suffice without the hyperbolic component.
[Method] Method section (proposed framework): the mechanism for generating text prototypes for unseen categories is not specified in sufficient detail to rule out supervision leakage, which directly affects the validity of the open-vocabulary consistency claim.

minor comments (1)

Abstract: no quantitative results, error bars, dataset statistics, or baseline numbers are supplied, which weakens the reader's ability to gauge the practical magnitude of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the framework's potential significance. We address each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: the outperformance claim on OV-AVEL for unseen categories rests on the full model but provides no ablation isolating the hierarchical entailment regularization loss (and hyperbolic mapping) from the Euclidean heterogeneous graph plus bidirectional constraints alone. This is load-bearing for the open-vocabulary result, as the graph structure may suffice without the hyperbolic component.

Authors: We agree that an explicit ablation isolating the hierarchical entailment regularization loss and hyperbolic mapping is necessary to substantiate the contribution to open-vocabulary performance. Our current ablations evaluate the heterogeneous graph construction, multi-directional edges, dual-threshold fusion, and bidirectional semantic constraints, but do not include a direct Euclidean-only variant without the hyperbolic component. We will add this ablation study to the revised Experiments section, reporting performance on the OV-AVEL unseen categories for the full model versus the Euclidean graph plus bidirectional constraints alone. revision: yes
Referee: [Method] Method section (proposed framework): the mechanism for generating text prototypes for unseen categories is not specified in sufficient detail to rule out supervision leakage, which directly affects the validity of the open-vocabulary consistency claim.

Authors: We acknowledge that additional detail is required. Text prototypes for unseen categories are obtained by passing the raw category name strings through a frozen pre-trained CLIP text encoder with no fine-tuning or exposure to the OV-AVEL training data or labels. This design precludes supervision leakage. We will expand the Method section with a new subsection explicitly describing the text prototype generation pipeline, the encoder choice, its frozen status, and the absence of any target-dataset supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: method description and experimental claims are self-contained without reductions to fitted inputs or self-referential definitions.

full rationale

The paper proposes a heterogeneous graph construction in Euclidean space followed by hyperbolic mapping and a hierarchical entailment loss, then reports outperformance on the OV-AVEL benchmark via experiments and ablations. No equations, parameter-fitting steps, or derivations are exhibited that reduce a claimed prediction or result to an input quantity by construction. The bidirectional constraints and entailment regularization are presented as architectural choices whose contribution is assessed externally via ablation, not defined in terms of the target consistency metric. Self-citations are absent from the provided text, and the open-vocabulary claim rests on benchmark evaluation rather than any self-definitional loop. This is the normal case of an empirical ML architecture paper whose central claims remain falsifiable outside its own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or invented entities; the described components (graphs, thresholds, loss) are treated as modeling choices rather than new postulates with independent evidence.

pith-pipeline@v0.9.1-grok · 5796 in / 1243 out tokens · 21781 ms · 2026-06-27T22:19:39.422858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 1 linked inside Pith

[1]

Multi-layer probabilistic association reasoning network for image-text retrieval.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(10):9706–9717, 2024

Wenrui Li, Ruiqin Xiong, and Xiaopeng Fan. Multi-layer probabilistic association reasoning network for image-text retrieval.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(10):9706–9717, 2024

2024
[2]

Fedpam: Federated personalized augmentation model for text- to-image retrieval

Yueying Feng, Fan Ma, Wang Lin, Chang Yao, Jingyuan Chen, and Yi Yang. Fedpam: Federated personalized augmentation model for text- to-image retrieval. InProceedings of the 2024 International Conference on Multimedia Retrieval, pages 1185–1189, 2024

2024
[3]

Neuron-based spiking transmission and reasoning network for robust image-text retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023

Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023

2023
[4]

Reservoir computing transformer for image- text retrieval

Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Penghong Wang, Jinqiao Shi, and Xiaopeng Fan. Reservoir computing transformer for image- text retrieval. InProceedings of the 31st ACM International Conference on Multimedia, pages 5605–5613, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 10, JULY 2024 12

2023
[5]

Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 34:3463–3474, 2025

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, and Heng Tao Shen. Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 34:3463–3474, 2025

2025
[6]

Ump: Unified modality-aware prompt tuning for text- video retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11954–11964, 2024

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Ump: Unified modality-aware prompt tuning for text- video retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11954–11964, 2024

2024
[7]

Exploiting unlabeled videos for video-text retrieval via pseudo-supervised learning.IEEE Transactions on Image Processing, 33:6748–6760, 2024

Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Exploiting unlabeled videos for video-text retrieval via pseudo-supervised learning.IEEE Transactions on Image Processing, 33:6748–6760, 2024

2024
[8]

Zero-shot video ground- ing with pseudo query lookup and verification.IEEE Transactions on Image Processing, 33:1643–1654, 2024

Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Zero-shot video ground- ing with pseudo query lookup and verification.IEEE Transactions on Image Processing, 33:1643–1654, 2024

2024
[9]

Aicl: Action in-context learning for text-to-video generation

Jianzhi Liu, Junchen Zhu, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, and Jingkuan Song. Aicl: Action in-context learning for text-to-video generation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9414–9423, 2025

2025
[10]

Spiking tucker fusion transformer for audio-visual zero-shot learning.IEEE Transactions on Image Processing, 33:4840–4852, 2024

Wenrui Li, Penghong Wang, Ruiqin Xiong, and Xiaopeng Fan. Spiking tucker fusion transformer for audio-visual zero-shot learning.IEEE Transactions on Image Processing, 33:4840–4852, 2024

2024
[11]

Multi-timescale motion-decoupled spiking transformer for audio-visual zero-shot learning.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, and Yonghong Tian. Multi-timescale motion-decoupled spiking transformer for audio-visual zero-shot learning.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

2025
[12]

Motion-decoupled spiking transformer for audio- visual zero-shot learning

Wenrui Li, Xi-Le Zhao, Zhengyu Ma, Xingtao Wang, Xiaopeng Fan, and Yonghong Tian. Motion-decoupled spiking transformer for audio- visual zero-shot learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 3994–4002, 2023

2023
[13]

Multi-modal spiking tensor regression network for audio-visual zero-shot learning

Zhe Yang, Wenrui Li, Jinxiu Hou, and Guanghui Cheng. Multi-modal spiking tensor regression network for audio-visual zero-shot learning. Neurocomputing, 629:129636, 2025

2025
[14]

Spiking variational graph representation inference for video summarization.IEEE Transactions on Image Processing, 34:5697–5709, 2025

Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, and Xiaopeng Fan. Spiking variational graph representation inference for video summarization.IEEE Transactions on Image Processing, 34:5697–5709, 2025

2025
[16]

Dis- criminative cross-modality attention network for temporal inconsistent audio-visual event localization.IEEE Transactions on Image Processing, 30:7878–7888, 2021

Hanyu Xuan, Lei Luo, Zhenyu Zhang, Jian Yang, and Yan Yan. Dis- criminative cross-modality attention network for temporal inconsistent audio-visual event localization.IEEE Transactions on Image Processing, 30:7878–7888, 2021

2021
[17]

Cross-modal background suppression for audio-visual event localization

Yan Xia and Zhou Zhao. Cross-modal background suppression for audio-visual event localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19989– 19998, June 2022

2022
[18]

Dense modality interaction network for audio-visual event localization.IEEE Transactions on Multimedia, 25:2734–2748, 2023

Shuo Liu, Weize Quan, Chaoqun Wang, Yuan Liu, Bin Liu, and Dong- Ming Yan. Dense modality interaction network for audio-visual event localization.IEEE Transactions on Multimedia, 25:2734–2748, 2023

2023
[19]

Learning event-specific localization preferences for audio- visual event localization

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, and Qing Gu. Learning event-specific localization preferences for audio- visual event localization. InProceedings of the 31st ACM International Conference on Multimedia, pages 3446–3454, 2023

2023
[20]

Towards open-vocabulary audio-visual event localization

Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran Zhong, Xiaojun Chang, and Meng Wang. Towards open-vocabulary audio-visual event localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, June 2025

2025
[21]

Ov-davel: Towards open-vocabulary dense audio-visual event localization in untrimmed videos

Jiale Yu, Baopeng Zhang, Zhu Teng, and Jianping Fan. Ov-davel: Towards open-vocabulary dense audio-visual event localization in untrimmed videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 553–562, 2025

2025
[22]

Poincar ´e embeddings for learn- ing hierarchical representations

Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learn- ing hierarchical representations. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[23]

Hyperbolic audio-visual zero-shot learning

Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, and Lars Petersson. Hyperbolic audio-visual zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7873–7883, October 2023

2023
[24]

Shmamba: Structured hyperbolic state space model for audio-visual question answering.IEEE Transactions on Audio, Speech and Language Processing, 33:3582– 3593, 2025

Zhe Yang, Wenrui Li, and Guanghui Cheng. Shmamba: Structured hyperbolic state space model for audio-visual question answering.IEEE Transactions on Audio, Speech and Language Processing, 33:3582– 3593, 2025

2025
[25]

Css-net: A consistent segment selection network for audio-visual event localization

Fan Feng, Yue Ming, Nannan Hu, Hui Yu, and Yuanan Liu. Css-net: A consistent segment selection network for audio-visual event localization. IEEE Transactions on Multimedia, 26:701–713, 2024

2024
[26]

Listen with seeing: Cross-modal contrastive learning for audio- visual event localization.IEEE Transactions on Multimedia, 27:2650– 2665, 2025

Chao Sun, Min Chen, Chuanbo Zhu, Sheng Zhang, Ping Lu, and Jincai Chen. Listen with seeing: Cross-modal contrastive learning for audio- visual event localization.IEEE Transactions on Multimedia, 27:2650– 2665, 2025

2025
[27]

Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian, Shengeng Tang, Xiaojun Chang, and Dan Guo. Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10905–10913, April 2025

2025
[28]

Fine-grained audio–visual event localization.IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2025

Baoyu Fan, Lu Liu, Xiaochuan Li, Runze Zhang, Liang Jin, and Jin Zhang. Fine-grained audio–visual event localization.IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2025

2025
[29]

Hierarchical attention networks for document classification

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Ed- uard Hovy. Hierarchical attention networks for document classification. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016

2016
[30]

Hierarchical self-attention network for action localization in videos

Rizard Renanda Adhi Pramono, Yie-Tarng Chen, and Wen-Hsien Fang. Hierarchical self-attention network for action localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2019

2019
[31]

Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 13821–13831, June 2022

2022
[32]

Hyperbolic entailment cones for learning hierarchical embeddings

Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. InProceedings of the 35th International Conference on Machine Learning, pages 1646– 1655, July 2018

2018
[33]

Compositional entailment learning for hyperbolic vision-language models

Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[34]

Language-guided graph representation learning for video summarization.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan, and Yonghong Tian. Language-guided graph representation learning for video summarization.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

2025
[35]

Hyperbolic geometry.Flavors of Geometry, 31:59–115, 1997

James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry.Flavors of Geometry, 31:59–115, 1997

1997
[36]

Inferring concept hierarchies from text corpora via hyperbolic embeddings

Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3231–3241, July 2019

2019
[37]

Cross-modal relation-aware networks for audio-visual event localization

Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. Cross-modal relation-aware networks for audio-visual event localization. InProceedings of the 28th ACM International Conference on Multimedia, pages 3893–3901, 2020

2020
[38]

Audio-visual event localization in unconstrained videos

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. InProceedings of the European Conference on Computer Vision, September 2018

2018
[39]

Positive sample propagation along the audio-visual event line

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8436–8444, June 2021

2021
[40]

Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing

Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. InProceedings of the 30th ACM International Conference on Multimedia, pages 6241–6249, 2022

2022
[41]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, June 2023

2023
[42]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605, 2008

2008
[43]

Hyperbolic-constraint point cloud reconstruction from single rgb-d images

Wenrui Li, Zhe Yang, Wei Han, Hengyu Man, Xingtao Wang, and Xiaopeng Fan. Hyperbolic-constraint point cloud reconstruction from single rgb-d images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4959–4967, April 2025

2025
[44]

Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2020

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2020

Pith/arXiv arXiv 2020

[1] [1]

Multi-layer probabilistic association reasoning network for image-text retrieval.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(10):9706–9717, 2024

Wenrui Li, Ruiqin Xiong, and Xiaopeng Fan. Multi-layer probabilistic association reasoning network for image-text retrieval.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(10):9706–9717, 2024

2024

[2] [2]

Fedpam: Federated personalized augmentation model for text- to-image retrieval

Yueying Feng, Fan Ma, Wang Lin, Chang Yao, Jingyuan Chen, and Yi Yang. Fedpam: Federated personalized augmentation model for text- to-image retrieval. InProceedings of the 2024 International Conference on Multimedia Retrieval, pages 1185–1189, 2024

2024

[3] [3]

Neuron-based spiking transmission and reasoning network for robust image-text retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023

Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023

2023

[4] [4]

Reservoir computing transformer for image- text retrieval

Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Penghong Wang, Jinqiao Shi, and Xiaopeng Fan. Reservoir computing transformer for image- text retrieval. InProceedings of the 31st ACM International Conference on Multimedia, pages 5605–5613, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 10, JULY 2024 12

2023

[5] [5]

Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 34:3463–3474, 2025

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, and Heng Tao Shen. Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 34:3463–3474, 2025

2025

[6] [6]

Ump: Unified modality-aware prompt tuning for text- video retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11954–11964, 2024

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Ump: Unified modality-aware prompt tuning for text- video retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11954–11964, 2024

2024

[7] [7]

Exploiting unlabeled videos for video-text retrieval via pseudo-supervised learning.IEEE Transactions on Image Processing, 33:6748–6760, 2024

Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Exploiting unlabeled videos for video-text retrieval via pseudo-supervised learning.IEEE Transactions on Image Processing, 33:6748–6760, 2024

2024

[8] [8]

Zero-shot video ground- ing with pseudo query lookup and verification.IEEE Transactions on Image Processing, 33:1643–1654, 2024

Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Zero-shot video ground- ing with pseudo query lookup and verification.IEEE Transactions on Image Processing, 33:1643–1654, 2024

2024

[9] [9]

Aicl: Action in-context learning for text-to-video generation

Jianzhi Liu, Junchen Zhu, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, and Jingkuan Song. Aicl: Action in-context learning for text-to-video generation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9414–9423, 2025

2025

[10] [10]

Spiking tucker fusion transformer for audio-visual zero-shot learning.IEEE Transactions on Image Processing, 33:4840–4852, 2024

Wenrui Li, Penghong Wang, Ruiqin Xiong, and Xiaopeng Fan. Spiking tucker fusion transformer for audio-visual zero-shot learning.IEEE Transactions on Image Processing, 33:4840–4852, 2024

2024

[11] [11]

Multi-timescale motion-decoupled spiking transformer for audio-visual zero-shot learning.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, and Yonghong Tian. Multi-timescale motion-decoupled spiking transformer for audio-visual zero-shot learning.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

2025

[12] [12]

Motion-decoupled spiking transformer for audio- visual zero-shot learning

Wenrui Li, Xi-Le Zhao, Zhengyu Ma, Xingtao Wang, Xiaopeng Fan, and Yonghong Tian. Motion-decoupled spiking transformer for audio- visual zero-shot learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 3994–4002, 2023

2023

[13] [13]

Multi-modal spiking tensor regression network for audio-visual zero-shot learning

Zhe Yang, Wenrui Li, Jinxiu Hou, and Guanghui Cheng. Multi-modal spiking tensor regression network for audio-visual zero-shot learning. Neurocomputing, 629:129636, 2025

2025

[14] [14]

Spiking variational graph representation inference for video summarization.IEEE Transactions on Image Processing, 34:5697–5709, 2025

Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, and Xiaopeng Fan. Spiking variational graph representation inference for video summarization.IEEE Transactions on Image Processing, 34:5697–5709, 2025

2025

[15] [16]

Dis- criminative cross-modality attention network for temporal inconsistent audio-visual event localization.IEEE Transactions on Image Processing, 30:7878–7888, 2021

Hanyu Xuan, Lei Luo, Zhenyu Zhang, Jian Yang, and Yan Yan. Dis- criminative cross-modality attention network for temporal inconsistent audio-visual event localization.IEEE Transactions on Image Processing, 30:7878–7888, 2021

2021

[16] [17]

Cross-modal background suppression for audio-visual event localization

Yan Xia and Zhou Zhao. Cross-modal background suppression for audio-visual event localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19989– 19998, June 2022

2022

[17] [18]

Dense modality interaction network for audio-visual event localization.IEEE Transactions on Multimedia, 25:2734–2748, 2023

Shuo Liu, Weize Quan, Chaoqun Wang, Yuan Liu, Bin Liu, and Dong- Ming Yan. Dense modality interaction network for audio-visual event localization.IEEE Transactions on Multimedia, 25:2734–2748, 2023

2023

[18] [19]

Learning event-specific localization preferences for audio- visual event localization

Shiping Ge, Zhiwei Jiang, Yafeng Yin, Cong Wang, Zifeng Cheng, and Qing Gu. Learning event-specific localization preferences for audio- visual event localization. InProceedings of the 31st ACM International Conference on Multimedia, pages 3446–3454, 2023

2023

[19] [20]

Towards open-vocabulary audio-visual event localization

Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran Zhong, Xiaojun Chang, and Meng Wang. Towards open-vocabulary audio-visual event localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, June 2025

2025

[20] [21]

Ov-davel: Towards open-vocabulary dense audio-visual event localization in untrimmed videos

Jiale Yu, Baopeng Zhang, Zhu Teng, and Jianping Fan. Ov-davel: Towards open-vocabulary dense audio-visual event localization in untrimmed videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 553–562, 2025

2025

[21] [22]

Poincar ´e embeddings for learn- ing hierarchical representations

Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learn- ing hierarchical representations. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[22] [23]

Hyperbolic audio-visual zero-shot learning

Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, and Lars Petersson. Hyperbolic audio-visual zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7873–7883, October 2023

2023

[23] [24]

Shmamba: Structured hyperbolic state space model for audio-visual question answering.IEEE Transactions on Audio, Speech and Language Processing, 33:3582– 3593, 2025

Zhe Yang, Wenrui Li, and Guanghui Cheng. Shmamba: Structured hyperbolic state space model for audio-visual question answering.IEEE Transactions on Audio, Speech and Language Processing, 33:3582– 3593, 2025

2025

[24] [25]

Css-net: A consistent segment selection network for audio-visual event localization

Fan Feng, Yue Ming, Nannan Hu, Hui Yu, and Yuanan Liu. Css-net: A consistent segment selection network for audio-visual event localization. IEEE Transactions on Multimedia, 26:701–713, 2024

2024

[25] [26]

Listen with seeing: Cross-modal contrastive learning for audio- visual event localization.IEEE Transactions on Multimedia, 27:2650– 2665, 2025

Chao Sun, Min Chen, Chuanbo Zhu, Sheng Zhang, Ping Lu, and Jincai Chen. Listen with seeing: Cross-modal contrastive learning for audio- visual event localization.IEEE Transactions on Multimedia, 27:2650– 2665, 2025

2025

[26] [27]

Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian, Shengeng Tang, Xiaojun Chang, and Dan Guo. Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10905–10913, April 2025

2025

[27] [28]

Fine-grained audio–visual event localization.IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2025

Baoyu Fan, Lu Liu, Xiaochuan Li, Runze Zhang, Liang Jin, and Jin Zhang. Fine-grained audio–visual event localization.IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2025

2025

[28] [29]

Hierarchical attention networks for document classification

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Ed- uard Hovy. Hierarchical attention networks for document classification. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016

2016

[29] [30]

Hierarchical self-attention network for action localization in videos

Rizard Renanda Adhi Pramono, Yie-Tarng Chen, and Wen-Hsien Fang. Hierarchical self-attention network for action localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2019

2019

[30] [31]

Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 13821–13831, June 2022

2022

[31] [32]

Hyperbolic entailment cones for learning hierarchical embeddings

Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. InProceedings of the 35th International Conference on Machine Learning, pages 1646– 1655, July 2018

2018

[32] [33]

Compositional entailment learning for hyperbolic vision-language models

Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[33] [34]

Language-guided graph representation learning for video summarization.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan, and Yonghong Tian. Language-guided graph representation learning for video summarization.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

2025

[34] [35]

Hyperbolic geometry.Flavors of Geometry, 31:59–115, 1997

James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry.Flavors of Geometry, 31:59–115, 1997

1997

[35] [36]

Inferring concept hierarchies from text corpora via hyperbolic embeddings

Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3231–3241, July 2019

2019

[36] [37]

Cross-modal relation-aware networks for audio-visual event localization

Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. Cross-modal relation-aware networks for audio-visual event localization. InProceedings of the 28th ACM International Conference on Multimedia, pages 3893–3901, 2020

2020

[37] [38]

Audio-visual event localization in unconstrained videos

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. InProceedings of the European Conference on Computer Vision, September 2018

2018

[38] [39]

Positive sample propagation along the audio-visual event line

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8436–8444, June 2021

2021

[39] [40]

Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing

Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. InProceedings of the 30th ACM International Conference on Multimedia, pages 6241–6249, 2022

2022

[40] [41]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, June 2023

2023

[41] [42]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(Nov):2579–2605, 2008

2008

[42] [43]

Hyperbolic-constraint point cloud reconstruction from single rgb-d images

Wenrui Li, Zhe Yang, Wei Han, Hengyu Man, Xingtao Wang, and Xiaopeng Fan. Hyperbolic-constraint point cloud reconstruction from single rgb-d images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4959–4967, April 2025

2025

[43] [44]

Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2020

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2020

Pith/arXiv arXiv 2020