Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

Chongxin Gan; Jianning Wang; Yang Li; Yanru Wu

arxiv: 2605.19101 · v1 · pith:2YCCGMFAnew · submitted 2026-05-18 · 💻 cs.SD · cs.LG

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

Yanru Wu , Jianning Wang , Chongxin Gan , Yang Li This is my paper

Pith reviewed 2026-05-20 07:09 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords Audio Large Language Modelsdataset schedulingheterogeneityGrouped Sequential Traininggradient affinityAudioQAconvergencemulti-dataset training

0 comments

The pith

Grouping datasets by gradient affinity and training them sequentially speeds Audio LLM convergence by 30-40%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard uniform mixing of heterogeneous audio datasets creates conflicting gradients that slow convergence when training Audio Large Language Models. It introduces Grouped Sequential Training to organize datasets into affinity-based groups and add them progressively, combining the stability of parallel updates with the focused optimization of sequential passes. Gradient-based metrics make the grouping practical at scale by avoiding expensive transferability experiments. A sympathetic reader would care because faster convergence on diverse data directly supports building more capable general-purpose audio models without extra compute. Experiments across 14 AudioQA datasets confirm the speedup while matching or exceeding mixed-training accuracy.

Core claim

GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, gradient-based affinity metrics capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation.

What carries the argument

Gradient-based affinity metrics that group datasets for Grouped Sequential Training (GST) and its progressive scheduling protocol.

If this is right

GST reaches target performance 30-40% faster than standard parallel training on the same hardware.
Final model accuracy on speech, music, and environmental sound tasks matches or exceeds that of full mixed training.
The method remains model-agnostic and scales to large AudioQA collections.
Progressive introduction of groups reduces gradient conflicts that arise in uniform mixtures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grouping logic could shorten training cycles for vision-language or text-only LLMs that also face heterogeneous data sources.
Shorter convergence might let practitioners run more ablation studies or scale model size within fixed compute limits.
If affinity metrics generalize across modalities, dataset scheduling could become a standard pre-training step rather than an afterthought.

Load-bearing premise

Gradient-based affinity metrics can capture meaningful inter-dataset relationships without needing expensive empirical transferability tests.

What would settle it

A controlled experiment that applies GST with random instead of affinity-based groups and measures whether the 30-40% convergence speedup disappears.

Figures

Figures reproduced from arXiv: 2605.19101 by Chongxin Gan, Jianning Wang, Yang Li, Yanru Wu.

**Figure 1.** Figure 1: Illustration of sequential (left-up), parallel (left-down), and grouped sequential training (right). 2.2 Training Strategies for Multi-Dataset Learning Training models on multiple datasets is a fundamental challenge across multi-task (Zhang and Yang, 2021), continual (Kirkpatrick et al., 2017), and federated learning (Li et al., 2020). In the context of LLMs, research has increasingly shifted from purely… view at source ↗

**Figure 2.** Figure 2: Comparison of training dynamics and convergence across different scheduling protocols. (a), (b), and (c) represent Mix-all, GST (Progressive), and GST (Strict Sequential), respectively. The curves denote the validation accuracy across training epochs, while the bars indicate the average test accuracy across all datasets at the end of each stage. Schedule Avg W.Avg Avg W.Avg Progressive* 75.2 75.0 74.7 74.6… view at source ↗

**Figure 3.** Figure 3: Visualization of Dataset Relationship Measurements. (a) Taskonomy-based Affinity Matrix and (b) Gradient-based Affinity Matrix, where blue indicates higher similarity (closer relationship) and red indicates lower similarity (further distance). (c) t-SNE Visualization of Acoustic Features, demonstrating distinct clustering based on audio domains (e.g., Music, Speech, Environmental). large-scale datasets lik… view at source ↗

read the original abstract

Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30--40\% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Grouped Sequential Training (GST) for Audio Large Language Models (ALLMs). It analyzes multi-dataset AudioQA training from a convergence perspective, introduces gradient-based affinity metrics to group heterogeneous datasets (speech, music, environmental sounds), and applies a progressive scheduling protocol. Evaluations across 14 datasets claim GST yields 30-40% faster convergence than standard parallel training while matching or exceeding mix-all performance, providing a model-agnostic framework.

Significance. If the central results hold after addressing controls, this offers a practical method for managing dataset heterogeneity in large-scale audio model training, potentially improving efficiency without sacrificing final performance. The gradient-based affinity approach avoids expensive empirical transferability tests and the progressive schedule balances stability with speed. The broad evaluation spanning multiple audio domains is a positive aspect.

major comments (2)

[Experiments] Experiments section: The central claim of 30-40% faster convergence is presented without reported details on baseline implementations, statistical tests, dataset splits, model sizes, or variance across runs. This leaves the performance numbers without visible supporting controls or derivations.
[Method and Experiments] Method and Experiments: No ablation compares affinity-based grouping against random grouping or alternative partitioning strategies followed by the same progressive schedule. If any staged introduction of groups produces comparable speedups, the gradient-based affinity computation is not shown to be load-bearing for the reported gains.

minor comments (2)

[Abstract] Abstract: The claim of 'theoretical insights' is stated but the manuscript does not explicitly separate theoretical analysis from empirical observations.
[Method] Notation: The definition and computation of the gradient-based affinity metrics could be clarified with an explicit equation or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our results. We address each major comment below and have revised the manuscript to incorporate additional details and controls as appropriate.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of 30-40% faster convergence is presented without reported details on baseline implementations, statistical tests, dataset splits, model sizes, or variance across runs. This leaves the performance numbers without visible supporting controls or derivations.

Authors: We agree that the main text could more explicitly surface these controls. Baseline implementations for uniform mixture training are detailed in Section 4.1, with hyperparameter settings and optimization protocol in Appendix B. Statistical tests consist of paired t-tests over 5 independent runs (different seeds), with p < 0.01 reported for the convergence speedups in the revised Table 2. Dataset splits follow the canonical train/val/test partitions from each source paper, enumerated in Appendix C. All runs use the identical 1.5B-parameter base model. Standard deviations across runs are now plotted as shaded regions in Figure 3 and tabulated in the new Appendix D. A dedicated paragraph has been added to Section 4 to list these elements upfront. revision: yes
Referee: [Method and Experiments] Method and Experiments: No ablation compares affinity-based grouping against random grouping or alternative partitioning strategies followed by the same progressive schedule. If any staged introduction of groups produces comparable speedups, the gradient-based affinity computation is not shown to be load-bearing for the reported gains.

Authors: We accept that an explicit ablation is necessary to isolate the contribution of the gradient-based affinity metric. In the revised manuscript we have added Section 5.3 containing two new experiments: (1) GST with affinity grouping versus the identical progressive schedule but with random group assignment, and (2) GST versus domain-based partitioning. Results show that random grouping yields only 12–18 % speedup while affinity grouping retains the full 30–40 % gain; domain-based grouping falls in between. These ablations confirm that the affinity computation is load-bearing rather than the staging alone. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with independent empirical validation

full rationale

The paper defines gradient-based affinity metrics from first principles as a scalable proxy for inter-dataset relationships and describes the GST progressive scheduling protocol as an independent organizational strategy. Reported convergence gains are presented as outcomes of evaluations on 14 AudioQA datasets rather than reductions of any fitted parameter or self-referential definition. No equations equate the final performance claims to the affinity computation by construction, and no load-bearing step relies on self-citation chains or imported uniqueness theorems. The analysis from convergence perspective to model-agnostic framework remains externally falsifiable through the stated experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that dataset heterogeneity produces conflicting gradients and that gradient similarity is a reliable proxy for training compatibility.

axioms (1)

domain assumption Heterogeneous audio datasets produce conflicting gradients that slow convergence under uniform mixture training.
Explicitly stated as the core challenge motivating the work.

pith-pipeline@v0.9.0 · 5722 in / 1134 out tokens · 36664 ms · 2026-05-20T07:09:59.978572+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 8 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

2024 , url=

Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang , booktitle=. 2024 , url=

work page 2024
[9]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[10]

ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Contrastive learning of general-purpose audio representations , author=. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2021 , organization=

work page 2021
[11]

International Conference on Machine Learning , pages=

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[12]

Forty-second International Conference on Machine Learning , year=

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities , author=. Forty-second International Conference on Machine Learning , year=

work page
[13]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models , author=. arXiv preprint arXiv:2507.08128 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in Neural Information Processing Systems , volume=

Pengi: An audio language model for audio tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models , author=. arXiv preprint arXiv:2311.07919 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen2-Audio Technical Report

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

AudioPaLM: A Large Language Model That Can Speak and Listen

Audiopalm: A large language model that can speak and listen , author=. arXiv preprint arXiv:2306.12925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2305.11000 , year=

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities , author=. arXiv preprint arXiv:2305.11000 , year=

work page arXiv
[19]

arXiv preprint arXiv:2305.10790 , year=

Listen, think, and understand , author=. arXiv preprint arXiv:2305.10790 , year=

work page arXiv
[20]

IEEE transactions on knowledge and data engineering , volume=

A survey on multi-task learning , author=. IEEE transactions on knowledge and data engineering , volume=. 2021 , publisher=

work page 2021
[21]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

International conference on machine learning , pages=

Which tasks should be learned together in multi-task learning? , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

End-to-end multi-task learning with attention , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[24]

International Journal of Computer Vision , volume=

Curriculum learning: A survey , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

work page 2022
[25]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Taskonomy: Disentangling task transfer learning , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[26]

arXiv preprint arXiv:2502.11609 , year=

Exploiting Task Relationships for Continual Learning Using Transferability-Aware Task Embeddings , author=. arXiv preprint arXiv:2502.11609 , year=

work page arXiv
[27]

Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of Machine Learning Research , volume=

work page
[28]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=

work page
[29]

International Conference on Machine Learning (ICML) , pages=

Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning (ICML) , pages=

work page
[30]

Proceedings of the National Academy of Sciences (PNAS) , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=

work page
[31]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Gradient surgery for multi-task learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[32]

IEEE signal processing magazine , volume=

Federated learning: Challenges, methods, and future directions , author=. IEEE signal processing magazine , volume=. 2020 , publisher=

work page 2020
[33]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Advances in Neural Information Processing Systems , volume=

Doremi: Optimizing data mixtures speeds up language model pretraining , author=. Advances in Neural Information Processing Systems , volume=

work page
[35]

Advances in Neural Information Processing Systems , volume=

Skill-it! a data-driven skills framework for understanding and training language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Advances in Neural Information Processing Systems , volume=

Convergence analysis of sequential federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

International conference on machine learning , pages=

A unified theory of decentralized SGD with changing topology and local updates , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[39]

arXiv preprint arXiv:2308.12792 , year=

Sparks of large audio models: A survey and outlook , author=. arXiv preprint arXiv:2308.12792 , year=

work page arXiv
[40]

Learning to Multi-Task by Active Sampling

Learning to multi-task by active sampling , author=. arXiv preprint arXiv:1702.06053 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

DAGM German Conference on Pattern Recognition , pages=

Examining common paradigms in multi-task learning , author=. DAGM German Conference on Pattern Recognition , pages=. 2024 , organization=

work page 2024
[42]

Audiocaps: Generating captions for audios in the wild , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[43]

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , pages=

Chime-home: A dataset for sound source recognition in a domestic environment , author=. 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , pages=. 2015 , organization=

work page 2015
[44]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Clotho: An audio captioning dataset , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[45]

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=

Cochlscene: Acquisition of acoustic scene data using crowdsourcing , author=. 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=. 2022 , organization=

work page 2022
[46]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

work page 2008
[47]

arXiv preprint arXiv:2509.15662 , year=

Jamendo-QA: A Large-Scale Music Question Answering Dataset , author=. arXiv preprint arXiv:2509.15662 , year=

work page arXiv
[48]

2021 29th European Signal Processing Conference (EUSIPCO) , pages=

What is the ground truth? reliability of multi-annotator data for audio tagging , author=. 2021 29th European Signal Processing Conference (EUSIPCO) , pages=. 2021 , organization=

work page 2021
[49]

Learning Features of Music from Scratch

Learning features of music from scratch , author=. arXiv preprint arXiv:1611.09827 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Music understanding llama: Advancing text-to-music generation with question answering and captioning , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[51]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Prompttts: Controllable text-to-speech with text descriptions , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023
[52]

IEEE Transactions on Multimedia , volume=

Audio retrieval with natural language queries: A benchmark study , author=. IEEE Transactions on Multimedia , volume=. 2022 , publisher=

work page 2022
[53]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Textrolspeech: A text style control speech corpus with codec language text-to-speech models , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[54]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2024 , publisher=

work page 2024
[55]

2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

work page 2017
[56]

arXiv preprint arXiv:2507.03175 , year=

Understanding Knowledge Transferability for Transfer Learning: A Survey , author=. arXiv preprint arXiv:2507.03175 , year=

work page arXiv

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

2024 , url=

Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang , booktitle=. 2024 , url=

work page 2024

[9] [9]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[10] [10]

ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Contrastive learning of general-purpose audio representations , author=. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2021 , organization=

work page 2021

[11] [11]

International Conference on Machine Learning , pages=

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[12] [12]

Forty-second International Conference on Machine Learning , year=

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities , author=. Forty-second International Conference on Machine Learning , year=

work page

[13] [13]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models , author=. arXiv preprint arXiv:2507.08128 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Advances in Neural Information Processing Systems , volume=

Pengi: An audio language model for audio tasks , author=. Advances in Neural Information Processing Systems , volume=

work page

[15] [15]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models , author=. arXiv preprint arXiv:2311.07919 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Qwen2-Audio Technical Report

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

AudioPaLM: A Large Language Model That Can Speak and Listen

Audiopalm: A large language model that can speak and listen , author=. arXiv preprint arXiv:2306.12925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2305.11000 , year=

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities , author=. arXiv preprint arXiv:2305.11000 , year=

work page arXiv

[19] [19]

arXiv preprint arXiv:2305.10790 , year=

Listen, think, and understand , author=. arXiv preprint arXiv:2305.10790 , year=

work page arXiv

[20] [20]

IEEE transactions on knowledge and data engineering , volume=

A survey on multi-task learning , author=. IEEE transactions on knowledge and data engineering , volume=. 2021 , publisher=

work page 2021

[21] [21]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

International conference on machine learning , pages=

Which tasks should be learned together in multi-task learning? , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[23] [23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

End-to-end multi-task learning with attention , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[24] [24]

International Journal of Computer Vision , volume=

Curriculum learning: A survey , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

work page 2022

[25] [25]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Taskonomy: Disentangling task transfer learning , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[26] [26]

arXiv preprint arXiv:2502.11609 , year=

Exploiting Task Relationships for Continual Learning Using Transferability-Aware Task Embeddings , author=. arXiv preprint arXiv:2502.11609 , year=

work page arXiv

[27] [27]

Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of Machine Learning Research , volume=

work page

[28] [28]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=

work page

[29] [29]

International Conference on Machine Learning (ICML) , pages=

Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning (ICML) , pages=

work page

[30] [30]

Proceedings of the National Academy of Sciences (PNAS) , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=

work page

[31] [31]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Gradient surgery for multi-task learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page

[32] [32]

IEEE signal processing magazine , volume=

Federated learning: Challenges, methods, and future directions , author=. IEEE signal processing magazine , volume=. 2020 , publisher=

work page 2020

[33] [33]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Advances in Neural Information Processing Systems , volume=

Doremi: Optimizing data mixtures speeds up language model pretraining , author=. Advances in Neural Information Processing Systems , volume=

work page

[35] [35]

Advances in Neural Information Processing Systems , volume=

Skill-it! a data-driven skills framework for understanding and training language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[36] [36]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[37] [37]

Advances in Neural Information Processing Systems , volume=

Convergence analysis of sequential federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

work page

[38] [38]

International conference on machine learning , pages=

A unified theory of decentralized SGD with changing topology and local updates , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[39] [39]

arXiv preprint arXiv:2308.12792 , year=

Sparks of large audio models: A survey and outlook , author=. arXiv preprint arXiv:2308.12792 , year=

work page arXiv

[40] [40]

Learning to Multi-Task by Active Sampling

Learning to multi-task by active sampling , author=. arXiv preprint arXiv:1702.06053 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

DAGM German Conference on Pattern Recognition , pages=

Examining common paradigms in multi-task learning , author=. DAGM German Conference on Pattern Recognition , pages=. 2024 , organization=

work page 2024

[42] [42]

Audiocaps: Generating captions for audios in the wild , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019

[43] [43]

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , pages=

Chime-home: A dataset for sound source recognition in a domestic environment , author=. 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , pages=. 2015 , organization=

work page 2015

[44] [44]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Clotho: An audio captioning dataset , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020

[45] [45]

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=

Cochlscene: Acquisition of acoustic scene data using crowdsourcing , author=. 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=. 2022 , organization=

work page 2022

[46] [46]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

work page 2008

[47] [47]

arXiv preprint arXiv:2509.15662 , year=

Jamendo-QA: A Large-Scale Music Question Answering Dataset , author=. arXiv preprint arXiv:2509.15662 , year=

work page arXiv

[48] [48]

2021 29th European Signal Processing Conference (EUSIPCO) , pages=

What is the ground truth? reliability of multi-annotator data for audio tagging , author=. 2021 29th European Signal Processing Conference (EUSIPCO) , pages=. 2021 , organization=

work page 2021

[49] [49]

Learning Features of Music from Scratch

Learning features of music from scratch , author=. arXiv preprint arXiv:1611.09827 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Music understanding llama: Advancing text-to-music generation with question answering and captioning , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024

[51] [51]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Prompttts: Controllable text-to-speech with text descriptions , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023

[52] [52]

IEEE Transactions on Multimedia , volume=

Audio retrieval with natural language queries: A benchmark study , author=. IEEE Transactions on Multimedia , volume=. 2022 , publisher=

work page 2022

[53] [53]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Textrolspeech: A text style control speech corpus with codec language text-to-speech models , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024

[54] [54]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2024 , publisher=

work page 2024

[55] [55]

2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

work page 2017

[56] [56]

arXiv preprint arXiv:2507.03175 , year=

Understanding Knowledge Transferability for Transfer Learning: A Survey , author=. arXiv preprint arXiv:2507.03175 , year=

work page arXiv