pith. sign in

arxiv: 2604.07859 · v1 · submitted 2026-04-09 · 📡 eess.SP

Object-Attribute-Relation Model Driven Adaptive Hierarchical Transmission for Multimodal Semantic Communication

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 📡 eess.SP
keywords multimodal semantic communicationobject-attribute-relation modeladaptive hierarchical transmissionlow-bandwidth efficiencycliff effect eliminationscene graph accuracycross-modal fusionembodied agents
0
0 comments X

The pith

An Object-Attribute-Relation graph fuses multimodal data for transmission that cuts bandwidth by over 90 percent at 1-3 kbps while eliminating cliff effects in fading channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that building a decision-oriented topological graph from fused visual, text, and audio streams allows machines to receive only the semantic elements needed for tasks, bypassing pixel-level video reconstruction. The framework allocates limited bandwidth according to semantic priority and draws on cross-modal priors to sustain performance when one modality degrades. A sympathetic reader would care because conventional codecs optimized for human eyes create excessive latency and sudden total failures precisely when wireless links are most constrained, blocking real-time operation for agents that must act on partial data. If the approach holds, embodied systems could continue functioning reliably where current digital schemes collapse.

Core claim

By constructing an adaptive Object-Attribute-Relation hierarchy that fuses modalities into a topological graph, the framework transmits only priority-ranked semantic elements, compensates for visual loss with text and audio priors, and achieves over 90 percent bandwidth reduction at 1-3 kbps with superior scene-graph accuracy while removing the cliff effect at SNR below 4 dB through strict preservation of object anchors and delivering an 89 percent latency reduction compared with state-of-the-art digital codecs.

What carries the argument

The Object-Attribute-Relation (O-A-R) topological graph, which fuses multimodal inputs into a hierarchical structure that encodes semantic priorities for adaptive bandwidth allocation and cross-modal compensation.

If this is right

  • Machines receive decision-critical data at roughly one-tenth the bandwidth of conventional video streams while retaining higher scene-graph accuracy.
  • Performance degrades gradually rather than collapsing when channel conditions worsen, because object anchors remain protected.
  • End-to-end latency falls by 89 percent, enabling real-time responses for embodied agents that current codecs cannot support.
  • Text and audio streams directly substitute for missing visual detail, removing the need for separate high-rate visual codecs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph structure could support other machine-perception pipelines such as action recognition or navigation planning if the completeness assumption extends beyond scene graphs.
  • Communication protocols for future AI networks might standardize on semantic graphs rather than compressed pixels for machine-to-machine links.
  • Deployment on physical robots would reveal whether the cross-modal compensation introduces domain-specific biases when real sensor noise differs from the tested conditions.
  • Integration with existing knowledge-base systems could let the transmitted graph feed directly into higher-level reasoning modules without intermediate decoding.

Load-bearing premise

The Object-Attribute-Relation graph constructed from fused modalities contains every piece of information required for accurate downstream machine decisions, and cross-modal text and audio priors can fill visual gaps without introducing errors that matter to those decisions.

What would settle it

Run a side-by-side test at 2 kbps bandwidth in a 3 dB SNR fading channel: if the proposed system sustains scene-graph accuracy above 80 percent and task completion for an embodied agent while VVC or HEVC drops to zero success, the claim is supported; equal or worse accuracy would falsify it.

Figures

Figures reproduced from arXiv: 2604.07859 by Chenxing Li, Han Jiao, Mingquan Lu, Weiyao Lin, Xiaoming Tao, Yiping Duan.

Figure 1
Figure 1. Figure 1: Illustration of the hierarchical multimodal semantic communication framework. (a) Traditional pixel-oriented coding suffers from global degradation under low channel capacities, causing downstream task failures. (b) Embodied intelligence exhibits hierarchical semantic demands (Object > Relation > Attribute) lacking prioritized protection in standard codecs. (c) Our framework directly decodes a structured O… view at source ↗
Figure 2
Figure 2. Figure 2: Block diagram of the proposed multimodal hierarchical semantic communication system. Multimodal inputs are fused and explicitly decoupled into prioritized object, relation, and attribute streams. An adaptive gating mechanism selectively compresses and transmits these streams based on instantaneous CSI. Finally, a cascaded receiver directly reconstructs the structured semantic graph for downstream embodied … view at source ↗
Figure 3
Figure 3. Figure 3: Block diagram of the bandwidth-adaptive hierarchical trans￾mission. Governed by a policy controller π(β), the system dynamically computes an optimal transmission mask M based on channel bandwidth. This multiplexes prioritized semantic streams (Object, Relation, Attribute) into a variable-length sequence x, ensuring robust preservation of core survival semantics under deep fading. C. Adaptive Hierarchical T… view at source ↗
Figure 4
Figure 4. Figure 4: Comprehensive rate-semantic performance under varying SNR conditions. Relation Recall@50 (top row) and Object Recall@10 (bottom row) versus total and image-specific data rates (kbps). Our proposed framework maintains high semantic fidelity in the extreme low-bandwidth regime (1–3 kbps), significantly outperforming digital baselines and DeepJSCC, which suffer severe performance drops. stem solely from the p… view at source ↗
Figure 5
Figure 5. Figure 5: System robustness and graceful degradation under varying channel conditions. Relation Recall@20 (top row) and Object Recall@10 (bottom row) versus SNR under varying CBR constraints. While traditional baselines suffer severe “cliff effects” below 8 dB, our adaptive mechanism ensures graceful degradation, robustly preserving foundational object semantics even in deep fading (SNR ≤ 4 dB). TABLE IV DECODING FA… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of multimodal O-A-R Graph generation under varying SNRs. Relying on intermediate pixel reconstruction, baselines suffer severe “cliff effects,” yielding fragmented graphs with missing entities (red nodes). By directly decoding semantics, our framework exhibits graceful degradation, robustly preserving core object anchors (green nodes) even under deep fading channels (5–6 dB) where tr… view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot VQA accuracy across different cognitive question types [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Traditional video coding (VVC, HEVC) prioritizes human visual perception, transmitting substantial texture redundancy that severely hinders machine decision-making under constrained bandwidths. In dynamic channels, this redundancy causes severe ``cliff effects'' and prohibitive latency. To address this, we propose a robust multimodal semantic communication framework based on an adaptive Object-Attribute-Relation (O-A-R) hierarchy. Bypassing pixel-level reconstruction entirely, our framework directly fuses visual, textual, and audio streams to construct a decision-oriented topological graph. A bandwidth-adaptive strategy dynamically allocates resources by semantic priority, while a cross-modal mechanism leverages text and audio priors to compensate for severe visual degradation. Experimental results demonstrate that under extreme low bandwidths (1-3 kbps), our method achieves over a 90% bandwidth saving (an approximately 10-fold reduction) compared to state-of-the-art digital schemes, maintaining superior scene-graph accuracy. In deep fading channels (SNR <= 4 dB), it completely eliminates the cliff effect, ensuring graceful degradation by strictly preserving foundational object anchors even when traditional codecs suffer 100% decoding failure. Coupled with an 89\% reduction in end-to-end latency, our framework comprehensively fulfills the real-time survival requirements of embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes an Object-Attribute-Relation (O-A-R) model-driven adaptive hierarchical transmission framework for multimodal semantic communication. It fuses visual, textual, and audio streams into a decision-oriented topological graph, bypassing pixel-level reconstruction, and employs bandwidth-adaptive priority allocation plus cross-modal compensation for low-bandwidth and deep-fading channels. The central claims are >90% bandwidth savings (10-fold reduction) at 1-3 kbps with superior scene-graph accuracy, complete elimination of cliff effects at SNR ≤4 dB via preservation of object anchors, and 89% end-to-end latency reduction relative to traditional codecs such as VVC/HEVC.

Significance. If the O-A-R graph sufficiency and cross-modal compensation claims hold under rigorous testing, the work could meaningfully advance semantic communication for embodied agents by prioritizing task-relevant structure over perceptual texture. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described.

major comments (3)
  1. [Abstract] Abstract: the quantitative claims (90% bandwidth saving, 10-fold reduction, 89% latency reduction, complete cliff-effect elimination) are presented without any reference to datasets, baselines, channel models, or error bars; this renders the central performance assertions unevaluable from the provided text.
  2. [Proposed Method] Proposed framework (O-A-R construction and priority allocation): the load-bearing assumption that the fused topological graph plus text/audio priors retain all information necessary for downstream machine decision-making is stated but not supported by ablation studies on information loss, relational fidelity, or task failure rates when fine-grained dynamics are discarded.
  3. [Experiments] Experimental validation: no concrete baselines (specific digital schemes), metrics beyond scene-graph accuracy, or SNR/bandwidth operating points are detailed; without these the comparison to 'state-of-the-art digital schemes' and the '100% decoding failure' claim cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: 'cliff effect' and 'foundational object anchors' are used without explicit definition in the context of the O-A-R graph.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to strengthen the clarity and evaluability of our work. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the quantitative claims (90% bandwidth saving, 10-fold reduction, 89% latency reduction, complete cliff-effect elimination) are presented without any reference to datasets, baselines, channel models, or error bars; this renders the central performance assertions unevaluable from the provided text.

    Authors: We agree that the abstract would benefit from grounding the claims in the evaluation context. In the revised manuscript, we will add concise references to the datasets employed for multimodal scene-graph construction, the specific digital baselines (VVC and HEVC), the channel models (including Rayleigh fading for deep-fading cases), and a note that reported figures are averages over repeated trials with error bars. These additions will be kept brief to respect abstract length constraints while directing readers to the full experimental details. revision: yes

  2. Referee: [Proposed Method] Proposed framework (O-A-R construction and priority allocation): the load-bearing assumption that the fused topological graph plus text/audio priors retain all information necessary for downstream machine decision-making is stated but not supported by ablation studies on information loss, relational fidelity, or task failure rates when fine-grained dynamics are discarded.

    Authors: The O-A-R hierarchy is motivated by the observation that embodied-agent decision tasks depend primarily on semantic structure rather than pixel-level texture. We acknowledge that explicit empirical validation of this assumption strengthens the paper. In the revision, we will incorporate ablation studies that quantify information loss (via task-specific decision accuracy), relational fidelity (relation prediction precision), and failure rates when fine-grained dynamics are omitted, with and without cross-modal text/audio compensation. revision: yes

  3. Referee: [Experiments] Experimental validation: no concrete baselines (specific digital schemes), metrics beyond scene-graph accuracy, or SNR/bandwidth operating points are detailed; without these the comparison to 'state-of-the-art digital schemes' and the '100% decoding failure' claim cannot be assessed.

    Authors: We will expand the experiments section to list concrete baselines (VVC and HEVC with their standard encoding parameters), include additional metrics such as downstream task success rate and latency, and explicitly tabulate the SNR/bandwidth operating points (1-3 kbps across SNR ≤ 4 dB and higher). The '100% decoding failure' statement refers to cases in which traditional codecs produce no decodable output under deep fading, yielding zero machine-task accuracy; we will support this with the corresponding simulation results and error bars. revision: yes

Circularity Check

0 steps flagged

No circularity: O-A-R framework is a constructive proposal validated by experiments

full rationale

The paper proposes a novel multimodal semantic communication system that constructs a decision-oriented O-A-R topological graph from fused visual/textual/audio inputs, then applies priority-based adaptive transmission and cross-modal compensation. All central claims (90% bandwidth reduction at 1-3 kbps, elimination of cliff effect at SNR <=4 dB, 89% latency cut) are presented as outcomes of this new construction and are supported by direct experimental comparisons to VVC/HEVC baselines. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation, or definitional renaming; the graph sufficiency for downstream tasks is an explicit modeling assumption rather than a derived tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified premise that an O-A-R graph suffices for decision-making and that cross-modal compensation works under extreme degradation; no free parameters or invented physical entities are identifiable from the abstract.

axioms (2)
  • domain assumption An Object-Attribute-Relation topological graph extracted from multimodal streams contains sufficient information for downstream machine decision-making without pixel-level reconstruction.
    This assumption underpins the decision to bypass traditional video coding entirely.
  • domain assumption Text and audio priors can compensate for severe visual signal degradation in a way that preserves foundational object anchors.
    Invoked to explain graceful degradation in deep fading channels.

pith-pipeline@v0.9.0 · 5534 in / 1298 out tokens · 49625 ms · 2026-05-10T18:07:01.617549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Ericsson mobility report,

    Ericsson, “Ericsson mobility report,” Nov 2016, available online: https: //www.ericsson.com/en/mobility-report

  2. [2]

    6G: The Next Hyper-Connected Experience for All,

    Samsung Research, “6G: The Next Hyper-Connected Experience for All,” Samsung Electronics, Tech. Rep., 2020. [Online]. Available: https://research.samsung.com/next-generation-communications

  3. [3]

    Beyond transmitting bits: Context, semantics, and task-oriented communications,

    D. G ¨und¨uz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K. K. Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics, and task-oriented communications,”IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 5–41, 2022

  4. [4]

    Open x-embodiment: Robotic learning datasets and rt-x models,

    Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shahet al., “Open x-embodiment: Robotic learning datasets and rt-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023

  5. [5]

    Overview of the versatile video coding (vvc) standard and its applications,

    B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.- R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021

  6. [6]

    Video coding for machines: A paradigm of collaborative compression and intelligent analytics,

    L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,”IEEE Transactions on Image Processing, vol. 29, pp. 8680– 8695, 2020

  7. [7]

    Semantic communications: Principles and challenges,

    Z. Qin, X. Tao, J. Lu, W. Tong, and G. Y . Li, “Semantic communications: Principles and challenges,”arXiv preprint arXiv:2201.01389, 2021

  8. [8]

    Object-attribute- relation representation based video semantic communication,

    Q. Du, Y . Duan, Q. Yang, X. Tao, and M. Debbah, “Object-attribute- relation representation based video semantic communication,”IEEE Journal on Selected Areas in Communications, 2025

  9. [9]

    A survey on channel sounding technologies and measurements for uav-assisted communications,

    K. Mao, Q. Zhu, C.-X. Wang, X. Ye, J. Gomez-Ponce, X. Cai, Y . Miao, Z. Cui, Q. Wu, and W. Fan, “A survey on channel sounding technologies and measurements for uav-assisted communications,”IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–24, 2024

  10. [10]

    Deep joint source- channel coding for wireless image transmission,

    E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019

  11. [11]

    Impact of video compres- sion on the performance of object detection systems for surveillance applications,

    M. O’Byrne, M. Sugrue, A. Kokaramet al., “Impact of video compres- sion on the performance of object detection systems for surveillance applications,” in2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2022, pp. 1–8

  12. [12]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations,

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017

  13. [13]

    Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,

    Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappaet al., “Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5021–5028

  14. [14]

    Swinjscc: Taming swin transformer for deep joint source-channel coding,

    K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “Swinjscc: Taming swin transformer for deep joint source-channel coding,”IEEE Transactions on Cognitive Communications and Networking, vol. 11, no. 1, pp. 90–104, 2024

  15. [15]

    Multimodal machine learning: A survey and taxonomy,

    T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018

  16. [16]

    Temporal segment networks: Towards good practices for deep action recognition,

    L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 20–36

  17. [17]

    The sound of pixels,

    H. Zhao, C. Gan, A. Rouditchenko, C. V ondrick, J. McDermott, and A. Torralba, “The sound of pixels,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 570–586

  18. [18]

    Object–attribute– relation model-based semantic coding for image transmission,

    C. Li, Y . Duan, X. Tao, S. Hu, Q. Yang, and C. Chen, “Object–attribute– relation model-based semantic coding for image transmission,”Journal of the Franklin Institute, vol. 361, no. 11, p. 106942, 2024

  19. [19]

    Joint source–channel coding: Fundamentals and recent progress in practical designs,

    D. G ¨und¨uz, M. A. Wigger, T.-Y . Tung, P. Zhang, and Y . Xiao, “Joint source–channel coding: Fundamentals and recent progress in practical designs,”Proceedings of the IEEE, 2024. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14

  20. [20]

    Deepjscc-f: Deep joint source-channel coding of images with feedback,

    D. B. Kurka and D. G ¨und¨uz, “Deepjscc-f: Deep joint source-channel coding of images with feedback,”IEEE journal on selected areas in information theory, vol. 1, no. 1, pp. 178–193, 2020

  21. [21]

    Deepwive: Deep-learning-aided wireless video transmission,

    T.-Y . Tung and D. G ¨und¨uz, “Deepwive: Deep-learning-aided wireless video transmission,”IEEE Journal on Selected Areas in Communica- tions, vol. 40, no. 9, pp. 2570–2583, 2022

  22. [22]

    Deep joint source-channel coding for adaptive image transmission over mimo channels,

    H. Wu, Y . Shao, C. Bian, K. Mikolajczyk, and D. G ¨und¨uz, “Deep joint source-channel coding for adaptive image transmission over mimo channels,”IEEE Transactions on Wireless Communications, vol. 23, no. 10, pp. 15 002–15 017, 2024

  23. [23]

    A star modulation network for wireless image semantic transmission,

    X. Li, D. Ban, Z. Ruan, X. Yue, H. Chen, and Y . Sun, “A star modulation network for wireless image semantic transmission,”Scientific Reports, vol. 15, no. 1, p. 31127, 2025

  24. [24]

    Robust deep joint source-channel coding for video transmission over multipath fading channel,

    B. Xiao, J. Zou, F. Meng, W. Liu, and Y . Liang, “Robust deep joint source-channel coding for video transmission over multipath fading channel,”arXiv preprint arXiv:2601.01729, 2026

  25. [25]

    Generative ai driven task-oriented adaptive semantic communications,

    Y . Fu, W. Cheng, J. Wang, L. Yin, and W. Zhang, “Generative ai driven task-oriented adaptive semantic communications,”arXiv preprint arXiv:2407.11354, 2024

  26. [26]

    A multi-task oriented semantic communication framework for autonomous vehicles,

    E. Eldeeb, M. Shehab, and H. Alves, “A multi-task oriented semantic communication framework for autonomous vehicles,”IEEE Wireless Communications Letters, vol. 13, no. 12, pp. 3469–3473, 2024

  27. [27]

    An autoencoder-based task-oriented semantic communication system for m2m communication,

    P. Samarathunga, H. Rezaei, M. Lokumarambage, T. Sivalingam, N. Ra- jatheva, and A. Fernando, “An autoencoder-based task-oriented semantic communication system for m2m communication,”Algorithms, vol. 17, no. 11, p. 492, 2024

  28. [28]

    Few-shot adaptive learning for robust task-oriented semantic communication,

    W. Han, A. Zhang, C. Feng, R. Chen, S. Zhang, and S. Guo, “Few-shot adaptive learning for robust task-oriented semantic communication,” in AAAI 2025 Workshop on Artificial Intelligence for Wireless Communi- cations and Networking (AI4WCN), 2025

  29. [29]

    Video coding for machine,

    W. Gao, “Video coding for machine,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1–1

  30. [30]

    Video coding for machines: Compact visual representation compression for intelligent collaborative analytics,

    W. Yang, H. Huang, Y . Hu, L.-Y . Duan, and J. Liu, “Video coding for machines: Compact visual representation compression for intelligent collaborative analytics,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5174–5191, 2024

  31. [31]

    End-to-end learnable multi-scale feature compression for vcm,

    Y . Kim, H. Jeong, J. Yu, Y . Kim, J. Lee, S. Y . Jeong, and H. Y . Kim, “End-to-end learnable multi-scale feature compression for vcm,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3156–3167, 2023

  32. [32]

    Video coding for machines with neural-network-based chroma synthesis,

    M. Lorkiewicz, S. R ´o˙zek, O. Stankiewicz, T. Grajek, S. Ma´ckowiak, and M. Doma ´nski, “Video coding for machines with neural-network-based chroma synthesis,”IEEE Access, 2025

  33. [33]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  34. [34]

    Imagebind: One embedding space to bind them all,

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190

  35. [35]

    Unified cross-modal attention: robust audio-visual speech recognition and beyond,

    J. Li, C. Li, Y . Wu, and Y . Qian, “Unified cross-modal attention: robust audio-visual speech recognition and beyond,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1941–1953, 2024

  36. [36]

    Stnet: Deep audio–visual fusion network for robust speaker tracking,

    Y . Li, H. Liu, and B. Yang, “Stnet: Deep audio–visual fusion network for robust speaker tracking,”IEEE Transactions on Multimedia, vol. 27, pp. 1835–1847, 2024

  37. [37]

    Audio- vla: Adding contact audio perception to vision-language-action model for robotic manipulation,

    X. Wei, H. Zhang, X. Cao, S. Xie, W. Ge, Y . Li, and C. Wang, “Audio- vla: Adding contact audio perception to vision-language-action model for robotic manipulation,”arXiv preprint arXiv:2511.09958, 2025

  38. [38]

    Ad-avsr: Asymmet- ric dual-stream enhancement for robust audio-visual speech recognition,

    J. Xue, X. Liu, X. Wu, X. Yin, D. Huang, and F. Yu, “Ad-avsr: Asymmet- ric dual-stream enhancement for robust audio-visual speech recognition,” inProceedings of the 1st International Workshop & Challenge on Subtle Visual Computing, 2025, pp. 3–11

  39. [39]

    Enhancing multimodal unified representations for cross modal generalization,

    H. Huang, Y . Xia, S. Ji, S. Wang, H. Wang, M. Fang, J. Zhu, Z. Dong, S. Zhou, and Z. Zhao, “Enhancing multimodal unified representations for cross modal generalization,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 2353–2366

  40. [40]

    Cross-modal full-mode fine-grained alignment for text-to-image person retrieval,

    H. Yin, X. Man, F. Chen, J. Shao, and H. T. Shen, “Cross-modal full-mode fine-grained alignment for text-to-image person retrieval,” ACM Transactions on Multimedia Computing, Communications and Applications, 2025

  41. [41]

    A case study on visual- audio-tactile cross-modal retrieval,

    J. Wojcik, J. Jiang, J. Wu, and S. Luo, “A case study on visual- audio-tactile cross-modal retrieval,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 472–12 478

  42. [42]

    Dynamic masking and auxiliary hash learning for enhanced cross-modal retrieval,

    S. Zhang, Y . Wu, L. Shi, Y . Zhang, F. Kou, H. Jin, P. Zhang, M. Liang, and M. Xu, “Dynamic masking and auxiliary hash learning for enhanced cross-modal retrieval,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  43. [43]

    Bayesian speech synthesizers can learn from multiple teachers,

    Z. Zhang, Y . Gao, X. Xu, W. Wu, C. Zhanget al., “Bayesian speech synthesizers can learn from multiple teachers,”arXiv preprint arXiv:2510.24372, 2025

  44. [44]

    Scanformer: Referring expression comprehension by iteratively scanning,

    W. Su, P. Miao, H. Dou, and X. Li, “Scanformer: Referring expression comprehension by iteratively scanning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13 449–13 458

  45. [45]

    M2ist: Multi-modal interactive side-tuning for efficient referring expression comprehension,

    X. Liu, T. Liu, S. Huang, Y . Xin, Y . Hu, L. Qin, D. Wang, Y . Wu, and H. Chen, “M2ist: Multi-modal interactive side-tuning for efficient referring expression comprehension,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  46. [46]

    Image semantic steganog- raphy: A way to hide information in semantic communication,

    Y . Huo, S. Xiang, X. Luo, and X. Zhang, “Image semantic steganog- raphy: A way to hide information in semantic communication,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 2, pp. 1951–1960, 2025

  47. [47]

    Optical flow-based spatiotemporal sketch for video representation: A novel framework,

    Q. Du, Y . Duan, Z. Xie, X. Tao, L. Shi, and Z. Jin, “Optical flow-based spatiotemporal sketch for video representation: A novel framework,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6963–6977, 2024

  48. [48]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

  49. [49]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

  50. [50]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  51. [51]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

  52. [52]

    Deeply-supervised nets,

    C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” inArtificial intelligence and statistics. Pmlr, 2015, pp. 562–570

  53. [53]

    Overview of the high efficiency video coding (hevc) standard,

    G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,”IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649– 1668, 2012

  54. [54]

    Definition of the opus audio codec,

    J.-M. Valin, K. V os, and T. Terriberry, “Definition of the opus audio codec,”RFC 6716, IETF, 2012

  55. [55]

    Brotli: A general-purpose data compressor,

    J. Alakuijala, A. Farruggia, P. Ferragina, E. Kliuchnikov, R. Obryk, Z. Szabadka, and L. Vandevenne, “Brotli: A general-purpose data compressor,”ACM Transactions on Information Systems (TOIS), vol. 37, no. 1, pp. 1–30, 2018

  56. [56]

    A distance measure between attributed relational graphs for pattern recognition,

    A. Sanfeliu and K.-S. Fu, “A distance measure between attributed relational graphs for pattern recognition,”IEEE transactions on systems, man, and cybernetics, no. 3, pp. 353–362, 2012. Chenxing Li(Student Member, IEEE) received a B.E. degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2018, and received a master degree in ...