Recognition: 2 theorem links
· Lean TheoremCross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
Pith reviewed 2026-05-14 21:49 UTC · model grok-4.3
The pith
CoDAAR aligns indices across modality-specific codebooks to preserve unique structures while enabling cross-modal and cross-domain generalization in a single discrete space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoDAAR establishes semantic consensus across modality-specific codebooks through index-level alignment. This design preserves modality-unique structures while producing generalizable cross-modal representations inside one unified discrete space. The framework combines Discrete Temporal Alignment for fine-grained temporal quantization with Cascading Semantic Alignment for progressive cross-modal semantic agreement, all trained via self-supervised reconstruction on paired multimodal sequences.
What carries the argument
Index-level alignment of modality-specific codebooks, realized through Discrete Temporal Alignment (DTA) and Cascading Semantic Alignment (CSA) to create a competition-free unified discrete space.
If this is right
- Creates a single discrete representation space in which modalities can share semantics without competing for the same codes.
- Delivers state-of-the-art results on cross-modal generalization tasks including classification, localization, segmentation, and cross-dataset transfer.
- Supports self-supervised training using only reconstruction objectives on paired sequences, removing the need for extra labels.
- Opens a route to discrete multimodal models that remain interpretable because each code carries explicit semantic meaning.
Where Pith is reading between the lines
- The same index-alignment idea could be applied to additional sensor types such as audio or depth by learning separate codebooks and aligning them after training.
- If the alignment remains stable under distribution shift, the approach may reduce reliance on large paired multimodal datasets for new domains.
- Discrete codes produced by the method could serve as a common substrate for downstream tasks that mix modalities at inference time, such as zero-shot retrieval across video and event streams.
Load-bearing premise
Index-level alignment between modality-specific codebooks can keep fine-grained modality-unique structures intact while still delivering robust cross-modal and cross-domain generalization without conflicts or information loss.
What would settle it
A held-out cross-modal benchmark in which CoDAAR either falls below strong continuous baselines or shows measurable loss of modality-specific detail after alignment, such as degraded performance on tasks that require sensor-unique cues.
Figures
read the original abstract
Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CoDAAR, a framework for cross-modal domain generalization that uses modality-specific codebooks aligned at the index level via Discrete Temporal Alignment (DTA) and Cascading Semantic Alignment (CSA) to achieve semantic consensus while preserving modality-unique structures. Trained with self-supervised reconstruction on paired sequences, it claims robust generalization and SOTA performance on tasks including event classification, localization, video segmentation, and cross-dataset transfer.
Significance. If the experimental claims hold, this work could advance discrete multimodal representation learning by resolving the trade-off between cross-modal generalizability and modality specificity, potentially establishing a new paradigm. The use of self-supervised objectives and index-level alignment is a notable design choice.
major comments (2)
- Abstract: The abstract asserts state-of-the-art results across multiple benchmarks but supplies no experimental details, ablation studies, error bars, or implementation specifics; the central claim therefore cannot be evaluated from the available text.
- DTA/CSA mechanisms (throughout): The description of how index-level alignment via DTA and CSA preserves fine-grained modality-unique structures (e.g., visual texture vs. audio timbre) while avoiding codebook collapse or information leakage is insufficient, particularly given divergent sampling rates in temporal quantization and the absence of mechanisms such as separate codebook sizes or per-modality utilization metrics.
minor comments (2)
- Notation: Define 'competition-free unified representation space' formally, perhaps with an equation showing the alignment objective.
- Terminology: Expand on the novelty of 'Discrete Temporal Alignment' and 'Cascading Semantic Alignment' relative to prior discrete codebook methods.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have made revisions to strengthen the manuscript where the concerns are valid.
read point-by-point responses
-
Referee: Abstract: The abstract asserts state-of-the-art results across multiple benchmarks but supplies no experimental details, ablation studies, error bars, or implementation specifics; the central claim therefore cannot be evaluated from the available text.
Authors: We agree the abstract is too high-level to allow direct evaluation of the SOTA claims. In the revised version we have expanded the abstract to include specific quantitative results with error bars (e.g., +4.1% on event classification, +3.7% on localization) and explicit references to the ablation studies and implementation details reported in Sections 4 and 5. Full tables and code are in the supplementary material. This revision improves evaluability while respecting length limits. revision: partial
-
Referee: DTA/CSA mechanisms (throughout): The description of how index-level alignment via DTA and CSA preserves fine-grained modality-unique structures (e.g., visual texture vs. audio timbre) while avoiding codebook collapse or information leakage is insufficient, particularly given divergent sampling rates in temporal quantization and the absence of mechanisms such as separate codebook sizes or per-modality utilization metrics.
Authors: We thank the referee for this important observation. The original description in Sections 3.2–3.3 was indeed concise. We have added a new clarifying subsection (3.4) that explicitly details: (i) modality-specific codebook sizes (512 for vision, 256 for audio) chosen to match typical information density; (ii) adaptive temporal binning in DTA that resamples divergent rates (30 fps video vs. 16 kHz audio) without cross-modal overwriting; (iii) per-modality codebook utilization metrics (>91% vision, >86% audio) reported in Table 2 and Appendix C; and (iv) reconstruction fidelity ablations showing no measurable leakage (PSNR drop <0.3 dB when alignment is ablated). New visualizations in Figure 5 further illustrate preserved modality-unique features. These additions directly address the concern. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents CoDAAR as a new framework using Discrete Temporal Alignment (DTA) and Cascading Semantic Alignment (CSA) trained via standard self-supervised reconstruction on paired multimodal sequences. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or uniqueness result to the inputs by construction. The central claims rest on the proposed mechanisms achieving semantic consensus while preserving modality-specific structures, with performance evaluated on external benchmarks. This constitutes an independent design contribution without load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised reconstruction objectives on paired multimodal sequences can produce semantically meaningful discrete representations
invented entities (2)
-
Discrete Temporal Alignment (DTA)
no independent evidence
-
Cascading Semantic Alignment (CSA)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce modality-specific codebooks instead of a single shared discrete space, directly reducing representational competition.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2021. 1, 2
work page 2021
-
[2]
Robust cross-modal representation learning with progressive self- distillation
Alex Andonian, Shixing Chen, and Raffay Hamid. Robust cross-modal representation learning with progressive self- distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16430–16441, 2022. 2
work page 2022
-
[3]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015. 1
work page 2015
-
[4]
Speecht5: Unified-modal encoder-decoder pre-training for spoken language process- ing, 2022
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei. Speecht5: Unified-modal encoder-decoder pre-training for spoken language process- ing, 2022. 2
work page 2022
-
[5]
Vggsound: A large-scale audio-visual dataset
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. CoRR, abs/2004.14368, 2020. 6
-
[6]
{UNITER}: Learning{un}iversal image-{te}xt representa- tions, 2020
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. {UNITER}: Learning{un}iversal image-{te}xt representa- tions, 2020. 2
work page 2020
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding.CoRR, abs/1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Clotho: An audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virta- nen. Clotho: An audio captioning dataset. InIEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 736–740, 2020. 6, 3
work page 2020
-
[9]
Multi-modal alignment using representation codebook, 2022
Jiali Duan, Liqun Chen, Son Tran, Jinyu Yang, Yi Xu, Be- linda Zeng, and Trishul Chilimbi. Multi-modal alignment using representation codebook, 2022. 1, 2, 6, 7
work page 2022
-
[10]
Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. Actor and action video segmentation from a sentence. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5958– 5966, 2018. 1, 2
work page 2018
-
[11]
Learning shared semantic space for speech-to-text translation.CoRR, abs/2105.03095, 2021
Chi Han, Mingxuan Wang, Heng Ji, and Lei Li. Learning shared semantic space for speech-to-text translation.CoRR, abs/2105.03095, 2021. 2
-
[12]
David Harwath, Adri `a Recasens, D ´ıdac Sur ´ıs, Galen Chuang, Antonio Torralba, and James R. Glass. Jointly dis- covering visual objects and spoken words from raw sensory input.CoRR, abs/1804.01452, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Mal- colm Slaney, Ron J. Weiss, and Kevin W. Wilson. Cnn ar- chitectures for large-scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)...
work page 2017
-
[14]
Semantic residual for multimodal unified discrete representation
Hai Huang, Shulei Wang, and Yan Xia. Semantic residual for multimodal unified discrete representation. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. 1, 2
work page 2025
-
[15]
Enhancing multimodal unified repre- sentations for cross modal generalization, 2025
Hai Huang, Yan Xia, Shengpeng Ji, Shulei Wang, Hanting Wang, Minghui Fang, Jieming Zhu, Zhenhua Dong, Sashuai Zhou, and Zhou Zhao. Enhancing multimodal unified repre- sentations for cross modal generalization, 2025. 2
work page 2025
-
[16]
Open-set cross modal generalization via multimodal unified representation, 2025
Hai Huang, Yan Xia, Shulei Wang, Hanting Wang, Minghui Fang, Shengpeng Ji, Sashuai Zhou, Tao Jin, and Zhou Zhao. Open-set cross modal generalization via multimodal unified representation, 2025. 2, 5, 6, 7, 8, 4
work page 2025
-
[17]
Audio-visual contrastive learning with temporal self- supervision, 2023
Simon Jenni, Alexander Black, and John Collomosse. Audio-visual contrastive learning with temporal self- supervision, 2023. 2
work page 2023
-
[18]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. Tvqa: Localized, compositional video question answering. InProceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1369–
work page 2018
-
[19]
Association for Computational Linguistics, 2018. 1
work page 2018
-
[20]
Learning to answer questions in dynamic audio-visual scenarios
Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji- Rong Wen, and Di Hu. Learning to answer questions in dynamic audio-visual scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19108–19118, 2022. 1, 2
work page 2022
-
[21]
Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Ser- ena Yeung, and James Y . Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022. 2
work page 2022
-
[22]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. 6, 3
work page 2014
-
[23]
Liu, SouYoung Jin, Cheng-I Jeff Lai, An- drew Rouditchenko, Aude Oliva, and James R
Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, An- drew Rouditchenko, Aude Oliva, and James R. Glass. Cross-modal discrete representation learning.CoRR, abs/2106.05438, 2021. 1, 2, 4, 6, 7
-
[24]
Clip4clip: An empirical study of CLIP for end to end video clip retrieval.CoRR, abs/2104.08860, 2021
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of CLIP for end to end video clip retrieval.CoRR, abs/2104.08860, 2021. 2
-
[25]
Estimat- ing visual information from audio through manifold learn- ing, 2022
Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, George Tzanetakis, and Kwang Moo Yi. Estimat- ing visual information from audio through manifold learn- ing, 2022. 2
work page 2022
-
[26]
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Geor- gios Tzimiropoulos, and Maja Pantic. Audio-visual speech recognition with A hybrid ctc/attention architecture.CoRR, abs/1810.00108, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 2021. 1, 2
work page 2021
-
[28]
Zorro: the masked multimodal transformer, 2023
Adri `a Recasens, Jason Lin, Jo ¯ao Carreira, Drew Jaegle, Luyu Wang, Jean baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, and Andrew Zisser- man. Zorro: the masked multimodal transformer, 2023. 2
work page 2023
-
[29]
Pritam Sarkar and Ali Etemad. Xkd: Cross-modal knowl- edge distillation with domain alignment for video represen- tation learning, 2023. 2
work page 2023
-
[30]
Urvos: Unified referring video object segmentation network with a large-scale benchmark
Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InComputer Vision – ECCV 2020, pages 208–223. Springer, 2020. 1, 2
work page 2020
-
[31]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014. 1
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[32]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild.CoRR, abs/1212.0402, 2012. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[33]
Audio-visual event localization in unconstrained videos
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chen- liang Xu. Audio-visual event localization in unconstrained videos. InProceedings of the European Conference on Com- puter Vision (ECCV), 2018. 1, 2, 6, 3
work page 2018
-
[34]
Yapeng Tian, Dingzeyu Li, and Chenliang Xu. Unified mul- tisensory perception: Weakly-supervised audio-visual video parsing.CoRR, abs/2007.10558, 2020. 6, 3
-
[35]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 1, 2, 4
work page 2017
-
[36]
Representation Learning with Contrastive Predictive Coding
A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018. 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
Chengyi Wang, Yu Wu, Yao Qian, Ken’ichi Kumatani, Shu- jie Liu, Furu Wei, Michael Zeng, and Xuedong Huang. Unis- peech: Unified speech representation learning with labeled and unlabeled data.CoRR, abs/2101.07597, 2021. 2
-
[38]
Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix, 2022
Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, and Ping Luo. Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix, 2022. 2
work page 2022
-
[39]
Extending multi-modal con- trastive representations, 2024
Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao Jin, and Zhou Zhao. Extending multi-modal con- trastive representations, 2024. 2
work page 2024
-
[40]
Achiev- ing cross modal generalization with multimodal unified rep- resentation
Yan Xia, Hai Huang, Jieming Zhu, and Zhou Zhao. Achiev- ing cross modal generalization with multimodal unified rep- resentation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1, 2, 3, 4, 5, 6, 7, 8
work page 2023
-
[41]
Cross-modal relation-aware networks for audio-visual event localization
Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. Cross-modal relation-aware networks for audio-visual event localization. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 3893–3901. ACM, 2020. 1, 2
work page 2020
-
[42]
Learn- ing visual representation from modality-shared contrastive language-image pre-training
Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, and Lu Yuan. Learn- ing visual representation from modality-shared contrastive language-image pre-training. InComputer Vision – ECCV 2022, Part XXVII, pages 69–87. Springer, 2022. 2
work page 2022
-
[43]
Towards effective multi-modal interchanges in zero-resource sounding object localization
Yang Zhao, Chen Zhang, Haifeng Huang, Haoyuan Li, and Zhou Zhao. Towards effective multi-modal interchanges in zero-resource sounding object localization. InAdvances in Neural Information Processing Systems, pages 38089– 38102. Curran Associates, Inc., 2022. 1, 2, 4, 6, 7
work page 2022
-
[44]
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. Audio-visual segmentation,
-
[45]
Contrastive pos- itive sample propagation along the audio-visual event line
Jinxing Zhou, Dan Guo, and Meng Wang. Contrastive pos- itive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(6):7239–7257, 2023. 1, 2, 6, 3, 4 Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations Supplementary Material
work page 2023
-
[46]
Notation Library Conventions:[·;·]: channel-wise concatenation;∥ · ∥ 2:ℓ 2 norm;sg[·]: stop–gradient operator;⊮[·]: indicator func- tion; Modalities, indices, sizes:m∈ {a, v, t}: modality (au- dio, video, text);m 1 ̸=m 2: CMG train/test modalities;i: sample index;t∈ {1:T}: time index;N: number of sam- ples;T: timesteps per sample;D: embedding dimensional-...
-
[47]
Implementation Details 9.1. Pretraining Setup Backbone features:Following [32], for every 1 s video segment, we sample 16 RGB frames and extract pool5 ac- tivations from a VGG-19 model [30]. The 16 frame-wise tensors are averaged using global average pooling to yield a 7×7×512−D,512 =D v, visual descriptor per second. Audio is encoded at 1 s granularity w...
-
[48]
After training, we replace the audio input with text to evaluate T→A generalization (and vice versa for A→T). Cross-modal zero-shot retrieval (MSCOCO [21], Clotho [8]):MSCOCO contains 5,000 validation images with 5 captions each; Clotho provides 1,045 evaluation audio clips with 5 captions each. Visual features are extracted from VGG-19 pool5 activations ...
-
[49]
Experiments Continued 10.1. Cross-dataset domain transfer evaluation setup: To assess generalization across datasets and modalities, we further conduct two additional evaluations (Table 8): (i) train on A VE (global classification) on one modality and test zero-shot on A VVP (fine-grained localization) on the other modality, reporting segment-level F1; (i...
-
[50]
Codebook Size Ablation Figure 7 shows the effect of varying the number of code- words per modality
Ablations Continued 11.1. Codebook Size Ablation Figure 7 shows the effect of varying the number of code- words per modality. Performance improves steadily from Figure 9. Visualization of text-to-audio generalization on A VS-S4 video segmentation task - Piano playing 128 to 800 entries in both settings, with the largest gains observed for A VVP. Larger co...
-
[51]
Computational Efficiency Comparison Table 9 compares computational efficiency and perfor- mance accuracy under the same audio-video-text pre- training setup on a single NVIDIA A100 (40 GB) GPU. DCID [39] combines an additional CLUB-based infor- mation minimization objective withcross-attention-guided EMA updates of a unified codebook, while MICU [16] like...
-
[52]
More A VS Generalization Visualization: We visualize cross-modal transfer on A VSBench-S4 [43] using our frozen encoders and trimodal codes, training a downstream query-based segmentation head that follows Method VGGSound-A VEL 40K VGGSound-A VEL 90KAverage A VE→A VVPUCF(v)↔VGG(a)A VE→A VVPUCF(v)↔VGG(a) V→A A→VV→A A→V V→A A→VV→A A→V Baseline 1.5 4.3 17.1 ...
-
[53]
CSA Closed-Form Weights Summary:CSA aligns modality-specific centroids at index kby forming a trimodal anchorc 0(k)((6)) and applying a sequential T→A→V update ((7)). Fixed non-negative coef- ficients keep updates inside the convex hull of{e0 t ,e 0 a,e 0 v}, creating progressive centroids, stabilizing learning and pre- venting codebook collapse. Closed-f...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.