Object-Attribute-Relation Model Driven Adaptive Hierarchical Transmission for Multimodal Semantic Communication
Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3
The pith
An Object-Attribute-Relation graph fuses multimodal data for transmission that cuts bandwidth by over 90 percent at 1-3 kbps while eliminating cliff effects in fading channels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing an adaptive Object-Attribute-Relation hierarchy that fuses modalities into a topological graph, the framework transmits only priority-ranked semantic elements, compensates for visual loss with text and audio priors, and achieves over 90 percent bandwidth reduction at 1-3 kbps with superior scene-graph accuracy while removing the cliff effect at SNR below 4 dB through strict preservation of object anchors and delivering an 89 percent latency reduction compared with state-of-the-art digital codecs.
What carries the argument
The Object-Attribute-Relation (O-A-R) topological graph, which fuses multimodal inputs into a hierarchical structure that encodes semantic priorities for adaptive bandwidth allocation and cross-modal compensation.
If this is right
- Machines receive decision-critical data at roughly one-tenth the bandwidth of conventional video streams while retaining higher scene-graph accuracy.
- Performance degrades gradually rather than collapsing when channel conditions worsen, because object anchors remain protected.
- End-to-end latency falls by 89 percent, enabling real-time responses for embodied agents that current codecs cannot support.
- Text and audio streams directly substitute for missing visual detail, removing the need for separate high-rate visual codecs.
Where Pith is reading between the lines
- The same graph structure could support other machine-perception pipelines such as action recognition or navigation planning if the completeness assumption extends beyond scene graphs.
- Communication protocols for future AI networks might standardize on semantic graphs rather than compressed pixels for machine-to-machine links.
- Deployment on physical robots would reveal whether the cross-modal compensation introduces domain-specific biases when real sensor noise differs from the tested conditions.
- Integration with existing knowledge-base systems could let the transmitted graph feed directly into higher-level reasoning modules without intermediate decoding.
Load-bearing premise
The Object-Attribute-Relation graph constructed from fused modalities contains every piece of information required for accurate downstream machine decisions, and cross-modal text and audio priors can fill visual gaps without introducing errors that matter to those decisions.
What would settle it
Run a side-by-side test at 2 kbps bandwidth in a 3 dB SNR fading channel: if the proposed system sustains scene-graph accuracy above 80 percent and task completion for an embodied agent while VVC or HEVC drops to zero success, the claim is supported; equal or worse accuracy would falsify it.
Figures
read the original abstract
Traditional video coding (VVC, HEVC) prioritizes human visual perception, transmitting substantial texture redundancy that severely hinders machine decision-making under constrained bandwidths. In dynamic channels, this redundancy causes severe ``cliff effects'' and prohibitive latency. To address this, we propose a robust multimodal semantic communication framework based on an adaptive Object-Attribute-Relation (O-A-R) hierarchy. Bypassing pixel-level reconstruction entirely, our framework directly fuses visual, textual, and audio streams to construct a decision-oriented topological graph. A bandwidth-adaptive strategy dynamically allocates resources by semantic priority, while a cross-modal mechanism leverages text and audio priors to compensate for severe visual degradation. Experimental results demonstrate that under extreme low bandwidths (1-3 kbps), our method achieves over a 90% bandwidth saving (an approximately 10-fold reduction) compared to state-of-the-art digital schemes, maintaining superior scene-graph accuracy. In deep fading channels (SNR <= 4 dB), it completely eliminates the cliff effect, ensuring graceful degradation by strictly preserving foundational object anchors even when traditional codecs suffer 100% decoding failure. Coupled with an 89\% reduction in end-to-end latency, our framework comprehensively fulfills the real-time survival requirements of embodied agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an Object-Attribute-Relation (O-A-R) model-driven adaptive hierarchical transmission framework for multimodal semantic communication. It fuses visual, textual, and audio streams into a decision-oriented topological graph, bypassing pixel-level reconstruction, and employs bandwidth-adaptive priority allocation plus cross-modal compensation for low-bandwidth and deep-fading channels. The central claims are >90% bandwidth savings (10-fold reduction) at 1-3 kbps with superior scene-graph accuracy, complete elimination of cliff effects at SNR ≤4 dB via preservation of object anchors, and 89% end-to-end latency reduction relative to traditional codecs such as VVC/HEVC.
Significance. If the O-A-R graph sufficiency and cross-modal compensation claims hold under rigorous testing, the work could meaningfully advance semantic communication for embodied agents by prioritizing task-relevant structure over perceptual texture. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described.
major comments (3)
- [Abstract] Abstract: the quantitative claims (90% bandwidth saving, 10-fold reduction, 89% latency reduction, complete cliff-effect elimination) are presented without any reference to datasets, baselines, channel models, or error bars; this renders the central performance assertions unevaluable from the provided text.
- [Proposed Method] Proposed framework (O-A-R construction and priority allocation): the load-bearing assumption that the fused topological graph plus text/audio priors retain all information necessary for downstream machine decision-making is stated but not supported by ablation studies on information loss, relational fidelity, or task failure rates when fine-grained dynamics are discarded.
- [Experiments] Experimental validation: no concrete baselines (specific digital schemes), metrics beyond scene-graph accuracy, or SNR/bandwidth operating points are detailed; without these the comparison to 'state-of-the-art digital schemes' and the '100% decoding failure' claim cannot be assessed.
minor comments (1)
- [Abstract] Abstract: 'cliff effect' and 'foundational object anchors' are used without explicit definition in the context of the O-A-R graph.
Simulated Author's Rebuttal
We thank the referee for the constructive comments that highlight opportunities to strengthen the clarity and evaluability of our work. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] Abstract: the quantitative claims (90% bandwidth saving, 10-fold reduction, 89% latency reduction, complete cliff-effect elimination) are presented without any reference to datasets, baselines, channel models, or error bars; this renders the central performance assertions unevaluable from the provided text.
Authors: We agree that the abstract would benefit from grounding the claims in the evaluation context. In the revised manuscript, we will add concise references to the datasets employed for multimodal scene-graph construction, the specific digital baselines (VVC and HEVC), the channel models (including Rayleigh fading for deep-fading cases), and a note that reported figures are averages over repeated trials with error bars. These additions will be kept brief to respect abstract length constraints while directing readers to the full experimental details. revision: yes
-
Referee: [Proposed Method] Proposed framework (O-A-R construction and priority allocation): the load-bearing assumption that the fused topological graph plus text/audio priors retain all information necessary for downstream machine decision-making is stated but not supported by ablation studies on information loss, relational fidelity, or task failure rates when fine-grained dynamics are discarded.
Authors: The O-A-R hierarchy is motivated by the observation that embodied-agent decision tasks depend primarily on semantic structure rather than pixel-level texture. We acknowledge that explicit empirical validation of this assumption strengthens the paper. In the revision, we will incorporate ablation studies that quantify information loss (via task-specific decision accuracy), relational fidelity (relation prediction precision), and failure rates when fine-grained dynamics are omitted, with and without cross-modal text/audio compensation. revision: yes
-
Referee: [Experiments] Experimental validation: no concrete baselines (specific digital schemes), metrics beyond scene-graph accuracy, or SNR/bandwidth operating points are detailed; without these the comparison to 'state-of-the-art digital schemes' and the '100% decoding failure' claim cannot be assessed.
Authors: We will expand the experiments section to list concrete baselines (VVC and HEVC with their standard encoding parameters), include additional metrics such as downstream task success rate and latency, and explicitly tabulate the SNR/bandwidth operating points (1-3 kbps across SNR ≤ 4 dB and higher). The '100% decoding failure' statement refers to cases in which traditional codecs produce no decodable output under deep fading, yielding zero machine-task accuracy; we will support this with the corresponding simulation results and error bars. revision: yes
Circularity Check
No circularity: O-A-R framework is a constructive proposal validated by experiments
full rationale
The paper proposes a novel multimodal semantic communication system that constructs a decision-oriented O-A-R topological graph from fused visual/textual/audio inputs, then applies priority-based adaptive transmission and cross-modal compensation. All central claims (90% bandwidth reduction at 1-3 kbps, elimination of cliff effect at SNR <=4 dB, 89% latency cut) are presented as outcomes of this new construction and are supported by direct experimental comparisons to VVC/HEVC baselines. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation, or definitional renaming; the graph sufficiency for downstream tasks is an explicit modeling assumption rather than a derived tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption An Object-Attribute-Relation topological graph extracted from multimodal streams contains sufficient information for downstream machine decision-making without pixel-level reconstruction.
- domain assumption Text and audio priors can compensate for severe visual signal degradation in a way that preserves foundational object anchors.
Reference graph
Works this paper leans on
-
[1]
Ericsson, “Ericsson mobility report,” Nov 2016, available online: https: //www.ericsson.com/en/mobility-report
work page 2016
-
[2]
6G: The Next Hyper-Connected Experience for All,
Samsung Research, “6G: The Next Hyper-Connected Experience for All,” Samsung Electronics, Tech. Rep., 2020. [Online]. Available: https://research.samsung.com/next-generation-communications
work page 2020
-
[3]
Beyond transmitting bits: Context, semantics, and task-oriented communications,
D. G ¨und¨uz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K. K. Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics, and task-oriented communications,”IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 5–41, 2022
work page 2022
-
[4]
Open x-embodiment: Robotic learning datasets and rt-x models,
Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shahet al., “Open x-embodiment: Robotic learning datasets and rt-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023
work page 2023
-
[5]
Overview of the versatile video coding (vvc) standard and its applications,
B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.- R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021
work page 2021
-
[6]
Video coding for machines: A paradigm of collaborative compression and intelligent analytics,
L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,”IEEE Transactions on Image Processing, vol. 29, pp. 8680– 8695, 2020
work page 2020
-
[7]
Semantic communications: Principles and challenges,
Z. Qin, X. Tao, J. Lu, W. Tong, and G. Y . Li, “Semantic communications: Principles and challenges,”arXiv preprint arXiv:2201.01389, 2021
-
[8]
Object-attribute- relation representation based video semantic communication,
Q. Du, Y . Duan, Q. Yang, X. Tao, and M. Debbah, “Object-attribute- relation representation based video semantic communication,”IEEE Journal on Selected Areas in Communications, 2025
work page 2025
-
[9]
A survey on channel sounding technologies and measurements for uav-assisted communications,
K. Mao, Q. Zhu, C.-X. Wang, X. Ye, J. Gomez-Ponce, X. Cai, Y . Miao, Z. Cui, Q. Wu, and W. Fan, “A survey on channel sounding technologies and measurements for uav-assisted communications,”IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–24, 2024
work page 2024
-
[10]
Deep joint source- channel coding for wireless image transmission,
E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019
work page 2019
-
[11]
M. O’Byrne, M. Sugrue, A. Kokaramet al., “Impact of video compres- sion on the performance of object detection systems for surveillance applications,” in2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2022, pp. 1–8
work page 2022
-
[12]
Visual genome: Connecting language and vision using crowdsourced dense image annotations,
R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017
work page 2017
-
[13]
Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,
Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappaet al., “Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5021–5028
work page 2024
-
[14]
Swinjscc: Taming swin transformer for deep joint source-channel coding,
K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “Swinjscc: Taming swin transformer for deep joint source-channel coding,”IEEE Transactions on Cognitive Communications and Networking, vol. 11, no. 1, pp. 90–104, 2024
work page 2024
-
[15]
Multimodal machine learning: A survey and taxonomy,
T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018
work page 2018
-
[16]
Temporal segment networks: Towards good practices for deep action recognition,
L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 20–36
work page 2016
-
[17]
H. Zhao, C. Gan, A. Rouditchenko, C. V ondrick, J. McDermott, and A. Torralba, “The sound of pixels,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 570–586
work page 2018
-
[18]
Object–attribute– relation model-based semantic coding for image transmission,
C. Li, Y . Duan, X. Tao, S. Hu, Q. Yang, and C. Chen, “Object–attribute– relation model-based semantic coding for image transmission,”Journal of the Franklin Institute, vol. 361, no. 11, p. 106942, 2024
work page 2024
-
[19]
Joint source–channel coding: Fundamentals and recent progress in practical designs,
D. G ¨und¨uz, M. A. Wigger, T.-Y . Tung, P. Zhang, and Y . Xiao, “Joint source–channel coding: Fundamentals and recent progress in practical designs,”Proceedings of the IEEE, 2024. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14
work page 2024
-
[20]
Deepjscc-f: Deep joint source-channel coding of images with feedback,
D. B. Kurka and D. G ¨und¨uz, “Deepjscc-f: Deep joint source-channel coding of images with feedback,”IEEE journal on selected areas in information theory, vol. 1, no. 1, pp. 178–193, 2020
work page 2020
-
[21]
Deepwive: Deep-learning-aided wireless video transmission,
T.-Y . Tung and D. G ¨und¨uz, “Deepwive: Deep-learning-aided wireless video transmission,”IEEE Journal on Selected Areas in Communica- tions, vol. 40, no. 9, pp. 2570–2583, 2022
work page 2022
-
[22]
Deep joint source-channel coding for adaptive image transmission over mimo channels,
H. Wu, Y . Shao, C. Bian, K. Mikolajczyk, and D. G ¨und¨uz, “Deep joint source-channel coding for adaptive image transmission over mimo channels,”IEEE Transactions on Wireless Communications, vol. 23, no. 10, pp. 15 002–15 017, 2024
work page 2024
-
[23]
A star modulation network for wireless image semantic transmission,
X. Li, D. Ban, Z. Ruan, X. Yue, H. Chen, and Y . Sun, “A star modulation network for wireless image semantic transmission,”Scientific Reports, vol. 15, no. 1, p. 31127, 2025
work page 2025
-
[24]
Robust deep joint source-channel coding for video transmission over multipath fading channel,
B. Xiao, J. Zou, F. Meng, W. Liu, and Y . Liang, “Robust deep joint source-channel coding for video transmission over multipath fading channel,”arXiv preprint arXiv:2601.01729, 2026
-
[25]
Generative ai driven task-oriented adaptive semantic communications,
Y . Fu, W. Cheng, J. Wang, L. Yin, and W. Zhang, “Generative ai driven task-oriented adaptive semantic communications,”arXiv preprint arXiv:2407.11354, 2024
-
[26]
A multi-task oriented semantic communication framework for autonomous vehicles,
E. Eldeeb, M. Shehab, and H. Alves, “A multi-task oriented semantic communication framework for autonomous vehicles,”IEEE Wireless Communications Letters, vol. 13, no. 12, pp. 3469–3473, 2024
work page 2024
-
[27]
An autoencoder-based task-oriented semantic communication system for m2m communication,
P. Samarathunga, H. Rezaei, M. Lokumarambage, T. Sivalingam, N. Ra- jatheva, and A. Fernando, “An autoencoder-based task-oriented semantic communication system for m2m communication,”Algorithms, vol. 17, no. 11, p. 492, 2024
work page 2024
-
[28]
Few-shot adaptive learning for robust task-oriented semantic communication,
W. Han, A. Zhang, C. Feng, R. Chen, S. Zhang, and S. Guo, “Few-shot adaptive learning for robust task-oriented semantic communication,” in AAAI 2025 Workshop on Artificial Intelligence for Wireless Communi- cations and Networking (AI4WCN), 2025
work page 2025
-
[29]
W. Gao, “Video coding for machine,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1–1
work page 2021
-
[30]
W. Yang, H. Huang, Y . Hu, L.-Y . Duan, and J. Liu, “Video coding for machines: Compact visual representation compression for intelligent collaborative analytics,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5174–5191, 2024
work page 2024
-
[31]
End-to-end learnable multi-scale feature compression for vcm,
Y . Kim, H. Jeong, J. Yu, Y . Kim, J. Lee, S. Y . Jeong, and H. Y . Kim, “End-to-end learnable multi-scale feature compression for vcm,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3156–3167, 2023
work page 2023
-
[32]
Video coding for machines with neural-network-based chroma synthesis,
M. Lorkiewicz, S. R ´o˙zek, O. Stankiewicz, T. Grajek, S. Ma´ckowiak, and M. Doma ´nski, “Video coding for machines with neural-network-based chroma synthesis,”IEEE Access, 2025
work page 2025
-
[33]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[34]
Imagebind: One embedding space to bind them all,
R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190
work page 2023
-
[35]
Unified cross-modal attention: robust audio-visual speech recognition and beyond,
J. Li, C. Li, Y . Wu, and Y . Qian, “Unified cross-modal attention: robust audio-visual speech recognition and beyond,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1941–1953, 2024
work page 1941
-
[36]
Stnet: Deep audio–visual fusion network for robust speaker tracking,
Y . Li, H. Liu, and B. Yang, “Stnet: Deep audio–visual fusion network for robust speaker tracking,”IEEE Transactions on Multimedia, vol. 27, pp. 1835–1847, 2024
work page 2024
-
[37]
X. Wei, H. Zhang, X. Cao, S. Xie, W. Ge, Y . Li, and C. Wang, “Audio- vla: Adding contact audio perception to vision-language-action model for robotic manipulation,”arXiv preprint arXiv:2511.09958, 2025
-
[38]
Ad-avsr: Asymmet- ric dual-stream enhancement for robust audio-visual speech recognition,
J. Xue, X. Liu, X. Wu, X. Yin, D. Huang, and F. Yu, “Ad-avsr: Asymmet- ric dual-stream enhancement for robust audio-visual speech recognition,” inProceedings of the 1st International Workshop & Challenge on Subtle Visual Computing, 2025, pp. 3–11
work page 2025
-
[39]
Enhancing multimodal unified representations for cross modal generalization,
H. Huang, Y . Xia, S. Ji, S. Wang, H. Wang, M. Fang, J. Zhu, Z. Dong, S. Zhou, and Z. Zhao, “Enhancing multimodal unified representations for cross modal generalization,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 2353–2366
work page 2025
-
[40]
Cross-modal full-mode fine-grained alignment for text-to-image person retrieval,
H. Yin, X. Man, F. Chen, J. Shao, and H. T. Shen, “Cross-modal full-mode fine-grained alignment for text-to-image person retrieval,” ACM Transactions on Multimedia Computing, Communications and Applications, 2025
work page 2025
-
[41]
A case study on visual- audio-tactile cross-modal retrieval,
J. Wojcik, J. Jiang, J. Wu, and S. Luo, “A case study on visual- audio-tactile cross-modal retrieval,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 472–12 478
work page 2024
-
[42]
Dynamic masking and auxiliary hash learning for enhanced cross-modal retrieval,
S. Zhang, Y . Wu, L. Shi, Y . Zhang, F. Kou, H. Jin, P. Zhang, M. Liang, and M. Xu, “Dynamic masking and auxiliary hash learning for enhanced cross-modal retrieval,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[43]
Bayesian speech synthesizers can learn from multiple teachers,
Z. Zhang, Y . Gao, X. Xu, W. Wu, C. Zhanget al., “Bayesian speech synthesizers can learn from multiple teachers,”arXiv preprint arXiv:2510.24372, 2025
-
[44]
Scanformer: Referring expression comprehension by iteratively scanning,
W. Su, P. Miao, H. Dou, and X. Li, “Scanformer: Referring expression comprehension by iteratively scanning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13 449–13 458
work page 2024
-
[45]
M2ist: Multi-modal interactive side-tuning for efficient referring expression comprehension,
X. Liu, T. Liu, S. Huang, Y . Xin, Y . Hu, L. Qin, D. Wang, Y . Wu, and H. Chen, “M2ist: Multi-modal interactive side-tuning for efficient referring expression comprehension,”IEEE Transactions on Circuits and Systems for Video Technology, 2025
work page 2025
-
[46]
Image semantic steganog- raphy: A way to hide information in semantic communication,
Y . Huo, S. Xiang, X. Luo, and X. Zhang, “Image semantic steganog- raphy: A way to hide information in semantic communication,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 2, pp. 1951–1960, 2025
work page 1951
-
[47]
Optical flow-based spatiotemporal sketch for video representation: A novel framework,
Q. Du, Y . Duan, Z. Xie, X. Tao, L. Shi, and Z. Jin, “Optical flow-based spatiotemporal sketch for video representation: A novel framework,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6963–6977, 2024
work page 2024
-
[48]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022
work page 2021
-
[49]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
work page 2019
-
[50]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[51]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988
work page 2017
-
[52]
C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” inArtificial intelligence and statistics. Pmlr, 2015, pp. 562–570
work page 2015
-
[53]
Overview of the high efficiency video coding (hevc) standard,
G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,”IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649– 1668, 2012
work page 2012
-
[54]
Definition of the opus audio codec,
J.-M. Valin, K. V os, and T. Terriberry, “Definition of the opus audio codec,”RFC 6716, IETF, 2012
work page 2012
-
[55]
Brotli: A general-purpose data compressor,
J. Alakuijala, A. Farruggia, P. Ferragina, E. Kliuchnikov, R. Obryk, Z. Szabadka, and L. Vandevenne, “Brotli: A general-purpose data compressor,”ACM Transactions on Information Systems (TOIS), vol. 37, no. 1, pp. 1–30, 2018
work page 2018
-
[56]
A distance measure between attributed relational graphs for pattern recognition,
A. Sanfeliu and K.-S. Fu, “A distance measure between attributed relational graphs for pattern recognition,”IEEE transactions on systems, man, and cybernetics, no. 3, pp. 353–362, 2012. Chenxing Li(Student Member, IEEE) received a B.E. degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2018, and received a master degree in ...
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.