WQ-Fusion: Dynamic Gated Attention for Cross-Domain Audio Representation

Gongping Huang; Hanchen Pei; Hao Zhang; Jacob Benesty; Jingdong Chen; Lei Ding; Mingda Lin; Tiantian Xiong; Xinyue Zhou

arxiv: 2606.26556 · v1 · pith:WJBFEZQDnew · submitted 2026-06-25 · 💻 cs.SD · cs.MM· eess.AS

WQ-Fusion: Dynamic Gated Attention for Cross-Domain Audio Representation

Mingda Lin , Lei Ding , Xinyue Zhou , Tiantian Xiong , Hanchen Pei , Gongping Huang , Hao Zhang , Jingdong Chen

show 1 more author

Jacob Benesty

This is my paper

Pith reviewed 2026-06-26 03:17 UTC · model grok-4.3

classification 💻 cs.SD cs.MMeess.AS

keywords cross-domain audiodual-encoder fusiongated attentionfeature modulationaudio representationwhisperqwen

0 comments

The pith

A dual-encoder system with gated attention fuses Whisper and Qwen features to reach 0.836 on the audio encoder benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes WQ-Fusion to combine two pre-trained models for audio and language into representations that work across different acoustic domains. It replaces static concatenation with an Adaptive Feature Modulation module and element-wise gated attention so the model can dynamically select relevant acoustic or semantic features from each encoder. The design is tested on the Interspeech 2026 Audio Encoder Capability Challenge Track A, where it records an overall score of 0.836 and exceeds the strongest single-encoder baseline. A sympathetic reader would care because most pre-trained models remain specialized, and a reliable way to route information across domains could reduce the need for separate models per task.

Core claim

By integrating whisper and qwen via an Adaptive Feature Modulation module and a novel element-wise gated attention mechanism, WQ-Fusion enables dynamic feature selection, allowing the model to selectively emphasize relevant acoustic and semantic dimensions and thereby achieve a superior overall score of 0.836 on the Interspeech 2026 Audio Encoder Capability Challenge benchmark.

What carries the argument

Element-wise gated attention mechanism that operates with the Adaptive Feature Modulation module to perform dynamic routing of heterogeneous information between the two encoders.

Load-bearing premise

The performance gain is produced by the dynamic feature selection of the Adaptive Feature Modulation module and gated attention rather than by benchmark-specific tuning or unstated implementation details.

What would settle it

An ablation experiment that removes or replaces the element-wise gated attention with static concatenation and shows the score falling to or below the single-encoder baseline.

Figures

Figures reproduced from arXiv: 2606.26556 by Gongping Huang, Hanchen Pei, Hao Zhang, Jacob Benesty, Jingdong Chen, Lei Ding, Mingda Lin, Tiantian Xiong, Xinyue Zhou.

read the original abstract

While pre-trained models excel in specialized tasks, learning universal representations across diverse acoustic domains remains challenging. To address this, we propose WQ-Fusion, a robust dual-encoder framework for cross-domain audio representation learning. Overcoming the limitations of static concatenation, WQ-Fusion integrates whisper and qwen via an Adaptive Feature Modulation module and a novel element-wise gated attention mechanism. This design enables dynamic feature selection, allowing the model to selectively emphasize relevant acoustic and semantic dimensions. Extensive experiments on the Interspeech 2026 Audio Encoder Capability Challenge (Track A) benchmark demonstrate that by effectively routing heterogeneous information, WQ-Fusion achieves a superior overall score of 0.836, significantly outperforming the strongest single-encoder baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WQ-Fusion reports a 0.836 benchmark score from fusing Whisper and Qwen but supplies no ablations to show the gated attention drives the gain rather than capacity or tuning.

read the letter

The main takeaway is that this paper describes a dual-encoder setup called WQ-Fusion that adds an Adaptive Feature Modulation module and element-wise gated attention on top of Whisper and Qwen, then claims a 0.836 overall score on the Interspeech 2026 Track A benchmark. The design is presented as enabling dynamic feature selection across acoustic and semantic dimensions.

What is actually new is the concrete choice of element-wise gating for this particular pair of encoders. It is a direct, practical extension of existing attention-based fusion methods rather than a new theoretical framework. The paper does a reasonable job of stating the problem with static concatenation and outlining how the gated path could address heterogeneous inputs.

The result itself is a single reported number that beats the strongest single-encoder baseline. That gives a clear comparison point for anyone already running the same benchmark.

The soft spot is exactly the isolation problem flagged in the stress test. No ablation replaces the gated attention with a simpler static combination while holding encoders, data, and optimizer fixed. Without that control, the margin cannot be confidently assigned to dynamic routing instead of extra parameters or benchmark-specific choices. The abstract also omits methods, error bars, and dataset details, so the central claim stays unevaluated from the given text.

This work is aimed at audio representation researchers who already experiment with multi-encoder fusion. A reader looking for an architecture sketch might extract the gating idea as a starting point, but the lack of controls makes the result hard to trust or build on.

I would not bring it to reading group. It does not merit peer review in its current state.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes WQ-Fusion, a dual-encoder framework combining Whisper and Qwen via an Adaptive Feature Modulation module and element-wise gated attention for dynamic feature selection in cross-domain audio representations. It reports an overall score of 0.836 on the Interspeech 2026 Audio Encoder Capability Challenge (Track A) benchmark, outperforming the strongest single-encoder baseline by effectively routing heterogeneous information.

Significance. If the performance gain can be isolated to the proposed dynamic routing components, the work could advance universal audio encoders by addressing static fusion limitations. The approach suggests a pathway for selective emphasis of acoustic and semantic dimensions across pre-trained models.

major comments (2)

[Abstract] Abstract: The headline result (0.836 score) is stated without any methods description, ablation studies, error bars, dataset details, or training protocol, so the central empirical claim cannot be evaluated or attributed to the Adaptive Feature Modulation and gated attention.
[Abstract] Abstract: No ablation isolates the element-wise gated attention (e.g., static concatenation or mean pooling of the same Whisper+Qwen encoders with fixed data, optimizer, and parameter count); the reported margin over single-encoder baselines therefore cannot be confidently assigned to dynamic selection rather than capacity or tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the abstract's presentation of results. We will revise the abstract to better contextualize the empirical claims and strengthen the attribution of gains to the proposed components.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result (0.836 score) is stated without any methods description, ablation studies, error bars, dataset details, or training protocol, so the central empirical claim cannot be evaluated or attributed to the Adaptive Feature Modulation and gated attention.

Authors: We agree the abstract is overly concise. In revision we will expand it with a brief methods overview (Adaptive Feature Modulation and element-wise gated attention), benchmark details, and a reference to ablation results and training protocols presented in Sections 3 and 4. Full error bars, dataset splits, and optimizer settings remain in the main text due to abstract length limits. revision: partial
Referee: [Abstract] Abstract: No ablation isolates the element-wise gated attention (e.g., static concatenation or mean pooling of the same Whisper+Qwen encoders with fixed data, optimizer, and parameter count); the reported margin over single-encoder baselines therefore cannot be confidently assigned to dynamic selection rather than capacity or tuning.

Authors: The manuscript already reports comparisons against single-encoder baselines and static fusion variants in Section 4.2. To directly isolate the gated attention contribution under matched conditions, we will add a new controlled ablation (static concatenation and mean pooling of the identical Whisper+Qwen encoders with fixed data, optimizer, and parameter budget) in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result with no derivation chain present

full rationale

The manuscript reports an empirical score of 0.836 on the Interspeech 2026 benchmark as the outcome of training the proposed WQ-Fusion architecture. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is therefore an experimental measurement rather than a mathematical reduction that could collapse to its own inputs by construction; the absence of any derivation chain makes circularity analysis inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5680 in / 929 out tokens · 46675 ms · 2026-06-26T03:17:59.808817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 8 linked inside Pith

[1]

task specificity

Introduction As a fundamental component of audio signal processing sys- tems, audio encoders are widely used across diverse down- stream applications, including speech processing, sound anal- ysis, and music information retrieval (MIR) [1]. From a per- ceptual perspective, the quality of the learned audio represen- tations largely determines the upper bou...

Pith/arXiv arXiv 2026
[2]

Related Work 2.1. Foundational Audio Encoders and Universal Repre- sentations Recent progress in audio representation learning has seen a convergence toward unified single-encoder architectures, each characterized by distinct inductive biases. Frameworks like AudioMAE [6] prioritize reconstruction-based acoustic mod- eling to capture fine-grained textures...
[3]

Method This section details the architectural design of WQ-Fusion, a dual-encoder framework engineered to bridge the gap between phonetic-centric and semantic-centric audio representations. As illustrated in Figure 1, the pipeline operates through four strate- gic stages: (i) the selection of complementary backbone en- coders; (ii) feature harmonization v...
[4]

Dataset To evaluate the cross-domain robustness of WQ-Fusion, we conduct experiments on a comprehensive benchmark compris- ing 15 datasets across three domains

Experiment 4.1. Dataset To evaluate the cross-domain robustness of WQ-Fusion, we conduct experiments on a comprehensive benchmark compris- ing 15 datasets across three domains. The Speech Domain com- prises Speech Commands (SC) [24], LibriCount [25], V oxLin- gua107 [26], V oxCeleb1 [27], ASVspoof [28], Fluent Speech Commands (FSC) [29], V ocalSound [30],...
[5]

Results and Ablation 5.1. Main Evaluation Results The comprehensive evaluation results presented in Table 1 pro- vide a detailed comparison between single-encoder baselines Table 1:Performance comparison of single encoders and fusion strategies. Single Fusion Domain Task Dasheng -Base AudioMAE Whisper -Large Qwen2 -Audio-7B Concat. Adapt. and Trans. Gated...

arXiv
[6]

Conclusion This work introduces WQ-Fusion, a novel framework for uni- versal audio representation that leverages the complementary strengths of encoders trained on diverse datasets and learn- ing paradigms. Built upon frozen Whisper and Qwen back- bones, the system introduces an Adaptive Feature Modulation mechanism to harmoniously align heterogeneous emb...
[7]

The numerical calculations in this paper have been done on the supercomput- ing system in the Supercomputing Center of Wuhan University

Acknowledgments This work was supported by the National Natural Science Foun- dation (NSFC) of China under Grant 62471340. The numerical calculations in this paper have been done on the supercomput- ing system in the Supercomputing Center of Wuhan University
[8]

All research concepts, experimental designs, data analyses, and the core scientific content were independently developed and produced by the human authors

Use of Generative AI Disclosure In the preparation of this manuscript, generative AI tools were utilized exclusively for English translation and language polish- ing. All research concepts, experimental designs, data analyses, and the core scientific content were independently developed and produced by the human authors. The authors take full re- sponsibi...
[9]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumb- ley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

2020
[10]

Audio self-supervised learning: A survey,

S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing, A. Kathan, B. Hu, and B. W. Schuller, “Audio self-supervised learning: A survey,”Patterns, vol. 3, no. 12, 2022

2022
[11]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[12]

WavLM: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[13]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language process- ing, vol. 29, pp. 3451–3460, 2021

2021
[14]

Masked autoencoders that lis- ten,

P.-Y . Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that lis- ten,”Advances in Neural Information Processing Systems, vol. 35, pp. 28 708–28 720, 2022

2022
[15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[16]

Qwen2-audio technical report,

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024
[17]

Midashenglm: Efficient au- dio understanding with general audio captions,

H. Dinkel, G. Li, J. Liu, J. Luan, Y . Niu, X. Sun, T. Wang, Q. Xiao, J. Zhang, and J. Zhou, “Midashenglm: Efficient au- dio understanding with general audio captions,”arXiv preprint arXiv:2508.03983, 2025

arXiv 2025
[18]

AUV: Teaching audio universal vector quantization with single nested codebook,

Y . Chen, K. Hu, L. Zhou, S. Feng, X. Yang, H. Chen, and X. Chen, “AUV: Teaching audio universal vector quantization with single nested codebook,”arXiv preprint arXiv:2509.21968, 2025

arXiv 2025
[19]

Unicodec: Unified audio codec with single domain-adaptive codebook,

Y . Jiang, Q. Chen, S. Ji, Y . Xi, W. Wang, C. Zhang, X. Yue, S. Zhang, and H. Li, “Unicodec: Unified audio codec with single domain-adaptive codebook,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), 2025, pp. 19 112–19 124

2025
[20]

Semanticodec: An ultra low bitrate semantic audio codec for general sound,

H. Liu, X. Xu, Y . Yuan, M. Wu, W. Wang, and M. D. Plumb- ley, “Semanticodec: An ultra low bitrate semantic audio codec for general sound,”IEEE Journal of Selected Topics in Signal Pro- cessing, vol. 18, no. 8, pp. 1448–1461, 2024

2024
[21]

The cmu-aist submis- sion for the icme 2025 audio encoder challenge,

S. Bharadwaj, S. Cornell, K. Choi, H.-j. Shim, S. Desh- mukh, S. Fukayama, and S. Watanabe, “The cmu-aist submis- sion for the icme 2025 audio encoder challenge,”arXiv preprint arXiv:2601.16273, 2026

arXiv 2025
[22]

Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free,

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huanget al., “Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free,” arXiv preprint arXiv:2505.06708, 2025

Pith/arXiv arXiv 2025
[23]

Dynamic cross attention for audio- visual person verification,

R. G. Praveen and J. Alam, “Dynamic cross attention for audio- visual person verification,” in2024 IEEE 18th International Con- ference on Automatic Face and Gesture Recognition (FG). IEEE, 2024, pp. 1–5

2024
[24]

Improving noise robust audio-visual speech recogni- tion via router-gated cross-modal feature fusion,

D. Lim, Y . Kim, D.-H. Kim, D.-H. Yang, and J.-H. Chang, “Improving noise robust audio-visual speech recogni- tion via router-gated cross-modal feature fusion,”arXiv preprint arXiv:2508.18734, 2025

arXiv 2025
[25]

Gia-mic: Multimodal emotion recogni- tion with gated interactive attention and modality-invariant learn- ing constraints,

J. He, J. Mi, and T. Toda, “Gia-mic: Multimodal emotion recogni- tion with gated interactive attention and modality-invariant learn- ing constraints,”arXiv preprint arXiv:2506.00865, 2025

arXiv 2025
[26]

Co- attendwg: Co-attentive dimension-wise gating and expert fusion for multi-modal offensive content detection,

M. M. Hossain, M. S. Hossain, S. Chaki, and M. F. Mridha, “Co- attendwg: Co-attentive dimension-wise gating and expert fusion for multi-modal offensive content detection,”IEEE Transactions on Artificial Intelligence, 2025

2025
[27]

Stock movement prediction with multimodal stable fusion via gated cross-attention mechanism,

C. Zong, J. Wan, L. Cascone, and H. Zhou, “Stock movement prediction with multimodal stable fusion via gated cross-attention mechanism,”Complex & Intelligent Systems, vol. 11, no. 9, p. 396, 2025

2025
[28]

Cognialign: Word-level multimodal speech alignment with gated cross-attention for alzheimer’s de- tection,

D. Ortiz-Perez, M. Benavent-Lledo, J. Rodriguez-Juan, J. Garcia- Rodriguez, and D. Tom ´as, “Cognialign: Word-level multimodal speech alignment with gated cross-attention for alzheimer’s de- tection,”Knowledge-Based Systems, p. 114264, 2025

2025
[29]

FiLM: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018
[30]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neuro- computing, vol. 568, p. 127063, 2024

2024
[31]

Bert: Pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguis- tics: human language technologies, volume 1 (long and short pa- pers), 2019, pp. 4171–4186

2019
[32]

Speech commands: A dataset for limited-vocabulary speech recognition,

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv preprint arXiv:1804.03209, 2018

Pith/arXiv arXiv 2018
[33]

Libricount, a dataset for speaker count estimation,

F.-R. St ¨oter, S. Chakrabarty, E. Habets, and B. Edler, “Libricount, a dataset for speaker count estimation,” 2018

2018
[34]

V oxlingua107: a dataset for spoken lan- guage recognition,

J. Valk and T. Alum ¨ae, “V oxlingua107: a dataset for spoken lan- guage recognition,” in2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 652–658

2021
[35]

V oxceleb: Large-scale speaker verification in the wild,

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxceleb: Large-scale speaker verification in the wild,”Computer Speech & Language, vol. 60, p. 101027, 2020

2020
[36]

Au- tomatic speaker verification spoofing and countermeasures chal- lenge (asvspoof 2015) database,

T. Kinnunen, Z. Wu, E. Nicholas Evans, and J. Yamagishi, “Au- tomatic speaker verification spoofing and countermeasures chal- lenge (asvspoof 2015) database,” 2018

2015
[37]

Speech model pre-training for end-to-end spoken language understanding,

L. Lugosch, M. Ravanelli, P. Ignoto, V . S. Tomar, and Y . Ben- gio, “Speech model pre-training for end-to-end spoken language understanding,”arXiv preprint arXiv:1904.03670, 2019

Pith/arXiv arXiv 1904
[38]

V ocalsound: A dataset for improv- ing human vocal sounds recognition,

Y . Gong, J. Yu, and J. Glass, “V ocalsound: A dataset for improv- ing human vocal sounds recognition,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2022, pp. 151–155

2022
[39]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014

2014
[40]

Esc: Dataset for environmental sound classifica- tion,

K. J. Piczak, “Esc: Dataset for environmental sound classifica- tion,” inProceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1015–1018

2015
[41]

Fsd50k: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “Fsd50k: an open dataset of human-labeled sound events,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

2021
[42]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inProceedings of the 22nd ACM inter- national conference on Multimedia, 2014, pp. 1041–1044

2014
[43]

General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,

E. Fonseca, M. Plakal, F. Font, D. P. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,”arXiv preprint arXiv:1807.09902, 2018

Pith/arXiv arXiv 2018
[44]

The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use,

B. L. Sturm, “The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use,”arXiv preprint arXiv:1306.1461, 2013

Pith/arXiv arXiv 2013
[45]

Neural audio synthesis of musical notes with wavenet autoencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” inInternational conference on machine learning. PMLR, 2017, pp. 1068–1077

2017
[46]

Fma: A dataset for music analysis,

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

Pith/arXiv arXiv 2016
[47]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022
[48]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017

[1] [1]

task specificity

Introduction As a fundamental component of audio signal processing sys- tems, audio encoders are widely used across diverse down- stream applications, including speech processing, sound anal- ysis, and music information retrieval (MIR) [1]. From a per- ceptual perspective, the quality of the learned audio represen- tations largely determines the upper bou...

Pith/arXiv arXiv 2026

[2] [2]

Related Work 2.1. Foundational Audio Encoders and Universal Repre- sentations Recent progress in audio representation learning has seen a convergence toward unified single-encoder architectures, each characterized by distinct inductive biases. Frameworks like AudioMAE [6] prioritize reconstruction-based acoustic mod- eling to capture fine-grained textures...

[3] [3]

Method This section details the architectural design of WQ-Fusion, a dual-encoder framework engineered to bridge the gap between phonetic-centric and semantic-centric audio representations. As illustrated in Figure 1, the pipeline operates through four strate- gic stages: (i) the selection of complementary backbone en- coders; (ii) feature harmonization v...

[4] [4]

Dataset To evaluate the cross-domain robustness of WQ-Fusion, we conduct experiments on a comprehensive benchmark compris- ing 15 datasets across three domains

Experiment 4.1. Dataset To evaluate the cross-domain robustness of WQ-Fusion, we conduct experiments on a comprehensive benchmark compris- ing 15 datasets across three domains. The Speech Domain com- prises Speech Commands (SC) [24], LibriCount [25], V oxLin- gua107 [26], V oxCeleb1 [27], ASVspoof [28], Fluent Speech Commands (FSC) [29], V ocalSound [30],...

[5] [5]

Results and Ablation 5.1. Main Evaluation Results The comprehensive evaluation results presented in Table 1 pro- vide a detailed comparison between single-encoder baselines Table 1:Performance comparison of single encoders and fusion strategies. Single Fusion Domain Task Dasheng -Base AudioMAE Whisper -Large Qwen2 -Audio-7B Concat. Adapt. and Trans. Gated...

arXiv

[6] [6]

Conclusion This work introduces WQ-Fusion, a novel framework for uni- versal audio representation that leverages the complementary strengths of encoders trained on diverse datasets and learn- ing paradigms. Built upon frozen Whisper and Qwen back- bones, the system introduces an Adaptive Feature Modulation mechanism to harmoniously align heterogeneous emb...

[7] [7]

The numerical calculations in this paper have been done on the supercomput- ing system in the Supercomputing Center of Wuhan University

Acknowledgments This work was supported by the National Natural Science Foun- dation (NSFC) of China under Grant 62471340. The numerical calculations in this paper have been done on the supercomput- ing system in the Supercomputing Center of Wuhan University

[8] [8]

All research concepts, experimental designs, data analyses, and the core scientific content were independently developed and produced by the human authors

Use of Generative AI Disclosure In the preparation of this manuscript, generative AI tools were utilized exclusively for English translation and language polish- ing. All research concepts, experimental designs, data analyses, and the core scientific content were independently developed and produced by the human authors. The authors take full re- sponsibi...

[9] [9]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumb- ley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

2020

[10] [10]

Audio self-supervised learning: A survey,

S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing, A. Kathan, B. Hu, and B. W. Schuller, “Audio self-supervised learning: A survey,”Patterns, vol. 3, no. 12, 2022

2022

[11] [11]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763

2021

[12] [12]

WavLM: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[13] [13]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language process- ing, vol. 29, pp. 3451–3460, 2021

2021

[14] [14]

Masked autoencoders that lis- ten,

P.-Y . Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that lis- ten,”Advances in Neural Information Processing Systems, vol. 35, pp. 28 708–28 720, 2022

2022

[15] [15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023

[16] [16]

Qwen2-audio technical report,

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024

[17] [17]

Midashenglm: Efficient au- dio understanding with general audio captions,

H. Dinkel, G. Li, J. Liu, J. Luan, Y . Niu, X. Sun, T. Wang, Q. Xiao, J. Zhang, and J. Zhou, “Midashenglm: Efficient au- dio understanding with general audio captions,”arXiv preprint arXiv:2508.03983, 2025

arXiv 2025

[18] [18]

AUV: Teaching audio universal vector quantization with single nested codebook,

Y . Chen, K. Hu, L. Zhou, S. Feng, X. Yang, H. Chen, and X. Chen, “AUV: Teaching audio universal vector quantization with single nested codebook,”arXiv preprint arXiv:2509.21968, 2025

arXiv 2025

[19] [19]

Unicodec: Unified audio codec with single domain-adaptive codebook,

Y . Jiang, Q. Chen, S. Ji, Y . Xi, W. Wang, C. Zhang, X. Yue, S. Zhang, and H. Li, “Unicodec: Unified audio codec with single domain-adaptive codebook,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), 2025, pp. 19 112–19 124

2025

[20] [20]

Semanticodec: An ultra low bitrate semantic audio codec for general sound,

H. Liu, X. Xu, Y . Yuan, M. Wu, W. Wang, and M. D. Plumb- ley, “Semanticodec: An ultra low bitrate semantic audio codec for general sound,”IEEE Journal of Selected Topics in Signal Pro- cessing, vol. 18, no. 8, pp. 1448–1461, 2024

2024

[21] [21]

The cmu-aist submis- sion for the icme 2025 audio encoder challenge,

S. Bharadwaj, S. Cornell, K. Choi, H.-j. Shim, S. Desh- mukh, S. Fukayama, and S. Watanabe, “The cmu-aist submis- sion for the icme 2025 audio encoder challenge,”arXiv preprint arXiv:2601.16273, 2026

arXiv 2025

[22] [22]

Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free,

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huanget al., “Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free,” arXiv preprint arXiv:2505.06708, 2025

Pith/arXiv arXiv 2025

[23] [23]

Dynamic cross attention for audio- visual person verification,

R. G. Praveen and J. Alam, “Dynamic cross attention for audio- visual person verification,” in2024 IEEE 18th International Con- ference on Automatic Face and Gesture Recognition (FG). IEEE, 2024, pp. 1–5

2024

[24] [24]

Improving noise robust audio-visual speech recogni- tion via router-gated cross-modal feature fusion,

D. Lim, Y . Kim, D.-H. Kim, D.-H. Yang, and J.-H. Chang, “Improving noise robust audio-visual speech recogni- tion via router-gated cross-modal feature fusion,”arXiv preprint arXiv:2508.18734, 2025

arXiv 2025

[25] [25]

Gia-mic: Multimodal emotion recogni- tion with gated interactive attention and modality-invariant learn- ing constraints,

J. He, J. Mi, and T. Toda, “Gia-mic: Multimodal emotion recogni- tion with gated interactive attention and modality-invariant learn- ing constraints,”arXiv preprint arXiv:2506.00865, 2025

arXiv 2025

[26] [26]

Co- attendwg: Co-attentive dimension-wise gating and expert fusion for multi-modal offensive content detection,

M. M. Hossain, M. S. Hossain, S. Chaki, and M. F. Mridha, “Co- attendwg: Co-attentive dimension-wise gating and expert fusion for multi-modal offensive content detection,”IEEE Transactions on Artificial Intelligence, 2025

2025

[27] [27]

Stock movement prediction with multimodal stable fusion via gated cross-attention mechanism,

C. Zong, J. Wan, L. Cascone, and H. Zhou, “Stock movement prediction with multimodal stable fusion via gated cross-attention mechanism,”Complex & Intelligent Systems, vol. 11, no. 9, p. 396, 2025

2025

[28] [28]

Cognialign: Word-level multimodal speech alignment with gated cross-attention for alzheimer’s de- tection,

D. Ortiz-Perez, M. Benavent-Lledo, J. Rodriguez-Juan, J. Garcia- Rodriguez, and D. Tom ´as, “Cognialign: Word-level multimodal speech alignment with gated cross-attention for alzheimer’s de- tection,”Knowledge-Based Systems, p. 114264, 2025

2025

[29] [29]

FiLM: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018

[30] [30]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neuro- computing, vol. 568, p. 127063, 2024

2024

[31] [31]

Bert: Pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguis- tics: human language technologies, volume 1 (long and short pa- pers), 2019, pp. 4171–4186

2019

[32] [32]

Speech commands: A dataset for limited-vocabulary speech recognition,

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv preprint arXiv:1804.03209, 2018

Pith/arXiv arXiv 2018

[33] [33]

Libricount, a dataset for speaker count estimation,

F.-R. St ¨oter, S. Chakrabarty, E. Habets, and B. Edler, “Libricount, a dataset for speaker count estimation,” 2018

2018

[34] [34]

V oxlingua107: a dataset for spoken lan- guage recognition,

J. Valk and T. Alum ¨ae, “V oxlingua107: a dataset for spoken lan- guage recognition,” in2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 652–658

2021

[35] [35]

V oxceleb: Large-scale speaker verification in the wild,

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxceleb: Large-scale speaker verification in the wild,”Computer Speech & Language, vol. 60, p. 101027, 2020

2020

[36] [36]

Au- tomatic speaker verification spoofing and countermeasures chal- lenge (asvspoof 2015) database,

T. Kinnunen, Z. Wu, E. Nicholas Evans, and J. Yamagishi, “Au- tomatic speaker verification spoofing and countermeasures chal- lenge (asvspoof 2015) database,” 2018

2015

[37] [37]

Speech model pre-training for end-to-end spoken language understanding,

L. Lugosch, M. Ravanelli, P. Ignoto, V . S. Tomar, and Y . Ben- gio, “Speech model pre-training for end-to-end spoken language understanding,”arXiv preprint arXiv:1904.03670, 2019

Pith/arXiv arXiv 1904

[38] [38]

V ocalsound: A dataset for improv- ing human vocal sounds recognition,

Y . Gong, J. Yu, and J. Glass, “V ocalsound: A dataset for improv- ing human vocal sounds recognition,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2022, pp. 151–155

2022

[39] [39]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014

2014

[40] [40]

Esc: Dataset for environmental sound classifica- tion,

K. J. Piczak, “Esc: Dataset for environmental sound classifica- tion,” inProceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1015–1018

2015

[41] [41]

Fsd50k: an open dataset of human-labeled sound events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “Fsd50k: an open dataset of human-labeled sound events,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021

2021

[42] [42]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inProceedings of the 22nd ACM inter- national conference on Multimedia, 2014, pp. 1041–1044

2014

[43] [43]

General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,

E. Fonseca, M. Plakal, F. Font, D. P. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,”arXiv preprint arXiv:1807.09902, 2018

Pith/arXiv arXiv 2018

[44] [44]

The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use,

B. L. Sturm, “The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use,”arXiv preprint arXiv:1306.1461, 2013

Pith/arXiv arXiv 2013

[45] [45]

Neural audio synthesis of musical notes with wavenet autoencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” inInternational conference on machine learning. PMLR, 2017, pp. 1068–1077

2017

[46] [46]

Fma: A dataset for music analysis,

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

Pith/arXiv arXiv 2016

[47] [47]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022

[48] [48]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017