Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge
Pith reviewed 2026-05-22 16:38 UTC · model grok-4.3
The pith
Providing device information at inference enables device-specific fine-tuning that lifts baseline accuracy from 50.72% to 51.89% in low-complexity acoustic scene classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a baseline system for acoustic scene classification that achieves 50.72 percent accuracy when operating without knowledge of the recording device. When device identity is supplied at inference time and the model is fine-tuned in a device-specific manner, accuracy rises to 51.89 percent. The task re-uses the limited training subset from 2024 with unrestricted external data allowed, and evaluation on the held-out set shows that eleven of twelve participating teams surpass the baseline while the strongest entry exceeds it by more than eight percentage points.
What carries the argument
Device-specific fine-tuning, which adapts the model parameters using knowledge of the recording device supplied at inference time.
If this is right
- Real-world systems can be deployed with prior knowledge of the microphone hardware and still maintain low computational cost.
- Transfer learning from external data becomes the primary route to performance when labeled training material is limited to 25 percent of the previous year's set.
- Low-complexity architectures must incorporate lightweight adaptation mechanisms rather than relying solely on device-invariant features.
- Future challenges can test whether similar metadata at inference improves other audio classification tasks under hardware variation.
Where Pith is reading between the lines
- The same device-aware approach could be tested in sound event detection or speaker verification to measure cross-task gains from metadata.
- If device identity proves useful here, comparable sensor-type labels might benefit image or video classification under varying capture hardware.
- Larger gains may appear once adaptation methods move beyond simple fine-tuning to more parameter-efficient techniques.
Load-bearing premise
Supplying device identity at inference time permits adaptation that meaningfully reflects real-world hardware mismatch.
What would settle it
An experiment that supplies device labels at test time yet records no accuracy gain over the device-agnostic baseline on the official evaluation set.
Figures
read the original abstract
This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge, along with its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022-2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics-reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy with a device-agnostic model, improving to 51.89% when incorporating device-specific fine-tuning. The task attracted 31 submissions from 12 teams, with 11 teams outperforming the baseline. The top-performing submission achieved an accuracy gain of more than 8 percentage points over the baseline on the evaluation set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Low-Complexity Acoustic Scene Classification with Device Information task for the DCASE 2025 Challenge. It describes the task setup continuing prior editions' focus on low-complexity models and device mismatch, with the key change that device identity is supplied at inference time to enable device-specific adaptation. The training data is the 25% subset from DCASE 2024 with no external data restrictions. The baseline system is reported to achieve 50.72% accuracy in the device-agnostic case and 51.89% after device-specific fine-tuning. Participation details note 31 submissions from 12 teams, with the top entry exceeding the baseline by more than 8 percentage points on the evaluation set.
Significance. If the reported 1.17 pp gain from device-specific fine-tuning holds under statistical scrutiny, the work would provide a useful reference point for how explicit device information can support adaptation to hardware mismatch in real-world ASC deployments. The emphasis on low-complexity models and transfer learning continues a practically relevant thread in the DCASE series, and the observed participation indicates community interest. The top submission's larger gain also highlights the headroom for further progress.
major comments (1)
- [Abstract] Abstract (baseline accuracies): The central empirical claim is that device information at inference enables effective adaptation, evidenced by the rise from 50.72% (device-agnostic) to 51.89% (device-specific fine-tuning). This 1.17 pp difference is presented without standard deviations, results from multiple random seeds, or any statistical significance test, so it is impossible to determine whether the gain exceeds typical run-to-run variability in ASC models and therefore whether the task modification produces a reliable signal.
minor comments (2)
- [Abstract] Abstract: The description of the baseline provides only high-level accuracy figures; a brief statement of the model architecture (e.g., CNN variant, parameter count) and evaluation protocol (e.g., cross-validation folds, exact fine-tuning procedure) would improve immediate readability.
- [Task description] Task description: The manuscript references the 25% training subset from DCASE 2024 but does not explicitly state whether the evaluation set composition or scene/device distribution matches prior years; a short comparison table would clarify continuity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript describing the DCASE 2025 Low-Complexity Acoustic Scene Classification with Device Information task. The single major comment concerns the statistical robustness of the reported baseline improvement; we address this point directly below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (baseline accuracies): The central empirical claim is that device information at inference enables effective adaptation, evidenced by the rise from 50.72% (device-agnostic) to 51.89% (device-specific fine-tuning). This 1.17 pp difference is presented without standard deviations, results from multiple random seeds, or any statistical significance test, so it is impossible to determine whether the gain exceeds typical run-to-run variability in ASC models and therefore whether the task modification produces a reliable signal.
Authors: We agree that the 1.17 pp improvement should be accompanied by measures of variability and a statistical test to allow readers to judge whether it exceeds typical run-to-run fluctuation. In the revised manuscript we will report baseline accuracies averaged over five independent random seeds together with standard deviations for both the device-agnostic and device-specific fine-tuning settings. We will also add a paired statistical significance test (McNemar’s test on the per-sample predictions) and state the resulting p-value. These additions will be placed in the abstract and in the experimental section describing the baseline. revision: yes
Circularity Check
Empirical baseline report with no derivations or self-referential fitting
full rationale
The paper is an empirical description of a DCASE 2025 challenge task and its baseline system. It reports measured accuracies (50.72% device-agnostic and 51.89% with device-specific fine-tuning) on held-out evaluation data with no equations, first-principles derivations, fitted parameters renamed as predictions, or mathematical claims that could reduce to their own inputs by construction. References to prior DCASE editions are contextual background rather than load-bearing self-citations justifying a uniqueness theorem or ansatz. The central results are externally falsifiable experimental measurements, not derived outputs, making the paper self-contained with no circularity in any derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge
INTRODUCTION Acoustic Scene Classification (ASC) aims to identify the type of environment in which an audio recording was made, based on a short excerpt [1]. Environments are defined as a set of real-world locations, such as Metro station, Urban park , or Public square . The ASC task has a long-standing presence in the DCASE Chal- lenge, evolving through ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
PREVIOUS EDITIONS In past editions of the task, various strategies have been pro- posed to improve generalization across different—and potentially unknown—recording devices. The most commonly used meth- ods in 2023 and 2024 were augmentation-based methods, such as Freq-MixStyle [7,8] and device impulse response augmentation [9]. Other approaches aimed to ...
work page 2023
-
[3]
TASK SETUP As discussed in the previous section, device mismatch, low- complexity constraints, and transfer learning have been extensively studied in the context of the ASC task. However, this year’s setup introduces key variations to the handling of device mismatch and transfer learning. Regarding device mismatch, the recording de- vice ID is now provide...
work page 2022
-
[4]
It employs a receptive-field-regularized, factorized CNN architecture
BASELINE SYSTEM Following the 2024 edition [5], the baseline system builds on a sim- plified variant of the top-performing submission from the 2023 edi- tion [25]. It employs a receptive-field-regularized, factorized CNN architecture. Audio recordings are first resampled to 32 kHz, then converted into mel spectrograms using a 4096-point FFT with a window ...
work page 2024
-
[5]
CHALLENGE RESULTS The challenge results will be added after the challenge has ended
-
[6]
CONCLUSION This paper presented the setup and baseline system for Task 1 of the DCASE 2025 Challenge. Building on previous editions, we con- tinue to address challenges such as low-complexity constraints, de- vice mismatch, and data scarcity. A key refinement is the provision of device information at inference time, enabling device-specific modeling. The ...
work page 2025
-
[7]
ACKNOWLEDGMENT The LIT AI Lab is supported by the Federal State of Upper Austria. Gerhard Widmer’s work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 re- search and innovation programme, grant agreement No 101019375 (Whither Music?)
work page 2020
-
[8]
Approaches to complex sound scene analysis,
E. Benetos, D. Stowell, and M. D. Plumbley, “Approaches to complex sound scene analysis,” in Cham: Springer International Publishing , 2018. 2Source Code: https://github.com/CPJKU/dcase2025 task1 baseline/tree/main Detection and Classification of Acoustic Scenes and Events 2025
work page 2018
-
[9]
T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classifica- tion in DCASE 2020 challenge: Generalization across devices and low complexity solutions,” inDCASE Workshop, 2020
work page 2020
-
[10]
I. Mart ´ın-Morat´o, T. Heittola, A. Mesaros, and T. Virtanen, “Low- complexity acoustic scene classification for multi-device audio: Anal- ysis of DCASE 2021 challenge systems,” inDCASE Workshop, 2021
work page 2021
-
[11]
Low-complexity acoustic scene classification in DCASE 2022 challenge,
I. Mart ´ın-Morat´o, F. Paissan, A. Ancilotto, T. Heittola, A. Mesaros, E. Farella, A. Brutti, and T. Virtanen, “Low-complexity acoustic scene classification in DCASE 2022 challenge,” inDCASE Workshop, 2022
work page 2022
-
[12]
Data-efficient low-complexity acoustic scene classification in the DCASE 2024 challenge,
F. Schmid, P. Primus, T. Heittola, A. Mesaros, I. Mart ´ın-Morat´o, K. Koutini, and G. Widmer, “Data-efficient low-complexity acoustic scene classification in the DCASE 2024 challenge,” in DCASE Work- shop, 2024
work page 2024
-
[13]
A multi-device dataset for urban acoustic scene classification,
A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” inDCASE Workshop, 2018
work page 2018
-
[14]
B. Kim, S. Yang, J. Kim, H. Park, J. Lee, and S. Chang, “Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification,” inInterspeech, 2022
work page 2022
-
[15]
F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “CP-JKU submission to DCASE22: Distilling knowledge for low-complexity convolutional neural networks from a patchout audio transformer,” DCASE Challenge, Tech. Rep., 2022
work page 2022
-
[16]
Device-robust acoustic scene classification via impulse response augmentation,
T. Morocutti, F. Schmid, K. Koutini, and G. Widmer, “Device-robust acoustic scene classification via impulse response augmentation,” in EUSIPCO, 2023
work page 2023
-
[17]
H. Truchan, T. H. Ngo, and Z. Ahmadi, “Ascdomain: Domain invari- ant device-adversarial isotropic knowledge distillation convolutional neural architecture,” inICASSP, 2025
work page 2025
-
[18]
K. Koutini, F. Henkel, H. Eghbal-zadeh, and G. Widmer, “CP-JKU submissions to DCASE’20: Low-complexity cross-device acoustic scene classification with RF-regularized CNNs,” DCASE Challenge, Tech. Rep., 2020
work page 2020
-
[19]
B. Kim, S. Yang, J. Kim, and S. Chang, “QTI submission to DCASE 2021: Residual normalization for device-imbalanced acoustic scene classification with efficient design,” DCASE Challenge, Tech. Rep., 2021
work page 2021
-
[20]
J.-H. Lee, J.-H. Choi, P. M. Byun, and J.-H. Chang, “Hyu submis- sion for the DCASE 2022: Efficient fine-tuning method using device- aware data-random-drop for device-imbalanced acoustic scene classi- fication,” DCASE Challenge, Tech. Rep., 2022
work page 2022
-
[21]
K. Koutini, J. Schl ¨uter, and G. Widmer, “CPJKU submission to DCASE21: Cross-device audio scene classification with wide sparse frequency-damped CNNs,” DCASE Challenge, Tech. Rep., 2021
work page 2021
-
[22]
Data-efficient acoustic scene classification via ensemble teachers distillation and pruning,
H. Bing, H. Wen, C. Zhengyang, J. Anbai, C. Xie, F. Pingyi, L. Cheng, L. Zhiqiang, L. Jia, Z. Wei-Qiang, and Q. Yanmin, “Data-efficient acoustic scene classification via ensemble teachers distillation and pruning,” DCASE Challenge, Tech. Rep., 2024
work page 2024
-
[23]
C.-H. H. Yang, H. Hu, S. M. Siniscalchi, Q. Wang, W. Yuyang, X. Xia, Y . Zhao, Y . Wu, Y . Wang, J. Du, and C.-H. Lee, “A lottery ticket hy- pothesis framework for low-complexity device-robust neural acoustic scene classification,” DCASE Challenge, Tech. Rep., 2021
work page 2021
-
[24]
J. Tan and Y . Li, “Low-complexity acoustic scene classification using blueprint separable convolution and knowledge distillation,” DCASE Challenge, Tech. Rep., 2023
work page 2023
-
[25]
Y . Cai, M. Lin, C. Zhu, S. Li, and X. Shao, “DCASE2023 task1 sub- mission: Device simulation and time-frequency separable convolu- tion for acoustic scene classification,” DCASE Challenge, Tech. Rep., 2023
work page 2023
-
[26]
CP-JKU submission to DCASE23: Efficient acoustic scene classifi- cation with cp-mobile,
F. Schmid, T. Morocutti, S. Masoudian, K. Koutini, and G. Widmer, “CP-JKU submission to DCASE23: Efficient acoustic scene classifi- cation with cp-mobile,” DCASE Challenge, Tech. Rep., 2023
work page 2023
-
[27]
Low-complexity acoustic scene clas- sification with limited training data,
Y .-F. Shao, P. Jiang, and W. Li, “Low-complexity acoustic scene clas- sification with limited training data,” DCASE Challenge, Tech. Rep., 2024
work page 2024
-
[28]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inICASSP, 2017
work page 2017
-
[29]
Y . Cai, M. Lin, S. Li, and X. Shao, “DCASE2024 task1 submission: Data-efficient acoustic scene classification with self-supervised teach- ers,” DCASE Challenge, Tech. Rep., 2024
work page 2024
-
[30]
Data-efficient acoustic scene classification with pre-trained CP-Mobile,
N. David, R. Aida, and S. Patrick, “Data-efficient acoustic scene classification with pre-trained CP-Mobile,” DCASE Challenge, Tech. Rep., 2024
work page 2024
-
[31]
Upb-nt submission to DCASE24: Dataset pruning for targeted knowledge distillation,
A. Werning and R. Haeb-Umbach, “Upb-nt submission to DCASE24: Dataset pruning for targeted knowledge distillation,” DCASE Chal- lenge, Tech. Rep., 2024
work page 2024
-
[32]
Distilling the knowledge of transformers and CNNs with CP-mobile,
F. Schmid, T. Morocutti, S. Masoudian, K. Koutini, and G. Widmer, “Distilling the knowledge of transformers and CNNs with CP-mobile,” in DCASE Workshop, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.