pith. sign in

arxiv: 2505.01747 · v2 · submitted 2025-05-03 · 📡 eess.AS · cs.SD

Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

Pith reviewed 2026-05-22 16:38 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords acoustic scene classificationdevice informationlow-complexity modelsdevice mismatchDCASE challengetransfer learningbaseline systeminference-time adaptation
0
0 comments X

The pith

Providing device information at inference enables device-specific fine-tuning that lifts baseline accuracy from 50.72% to 51.89% in low-complexity acoustic scene classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a DCASE 2025 task for classifying acoustic scenes under low-complexity constraints while addressing device mismatch. Device identity is now supplied at inference time so that models can adapt to the specific recording hardware rather than remaining device-agnostic. The provided baseline reaches 50.72 percent accuracy without device information and improves to 51.89 percent once device-specific fine-tuning is applied. Training data is restricted to the same 25 percent subset used the previous year, making transfer learning from external sources a central strategy. Multiple teams submitted systems that exceeded the baseline, confirming that the new information can be leveraged effectively.

Core claim

The paper establishes a baseline system for acoustic scene classification that achieves 50.72 percent accuracy when operating without knowledge of the recording device. When device identity is supplied at inference time and the model is fine-tuned in a device-specific manner, accuracy rises to 51.89 percent. The task re-uses the limited training subset from 2024 with unrestricted external data allowed, and evaluation on the held-out set shows that eleven of twelve participating teams surpass the baseline while the strongest entry exceeds it by more than eight percentage points.

What carries the argument

Device-specific fine-tuning, which adapts the model parameters using knowledge of the recording device supplied at inference time.

If this is right

  • Real-world systems can be deployed with prior knowledge of the microphone hardware and still maintain low computational cost.
  • Transfer learning from external data becomes the primary route to performance when labeled training material is limited to 25 percent of the previous year's set.
  • Low-complexity architectures must incorporate lightweight adaptation mechanisms rather than relying solely on device-invariant features.
  • Future challenges can test whether similar metadata at inference improves other audio classification tasks under hardware variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same device-aware approach could be tested in sound event detection or speaker verification to measure cross-task gains from metadata.
  • If device identity proves useful here, comparable sensor-type labels might benefit image or video classification under varying capture hardware.
  • Larger gains may appear once adaptation methods move beyond simple fine-tuning to more parameter-efficient techniques.

Load-bearing premise

Supplying device identity at inference time permits adaptation that meaningfully reflects real-world hardware mismatch.

What would settle it

An experiment that supplies device labels at test time yet records no accuracy gain over the device-agnostic baseline on the official evaluation set.

Figures

Figures reproduced from arXiv: 2505.01747 by Annamaria Mesaros, Florian Schmid, Gerhard Widmer, Irene Mart\'in-Morat\'o, Paul Primus, Toni Heittola.

Figure 1
Figure 1. Figure 1: Overview of Low-Complexity Acoustic Scene Classifica￾tion with Device Information. At inference time, models must oper￾ate under low-complexity constraints and handle both known (seen during training) and unknown (unseen during training) recording devices, with the device ID provided. The baseline follows a two￾stage training process: first, learning a general model, then adapt￾ing it to device-specific ch… view at source ↗
read the original abstract

This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge, along with its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022-2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics-reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy with a device-agnostic model, improving to 51.89% when incorporating device-specific fine-tuning. The task attracted 31 submissions from 12 teams, with 11 teams outperforming the baseline. The top-performing submission achieved an accuracy gain of more than 8 percentage points over the baseline on the evaluation set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Low-Complexity Acoustic Scene Classification with Device Information task for the DCASE 2025 Challenge. It describes the task setup continuing prior editions' focus on low-complexity models and device mismatch, with the key change that device identity is supplied at inference time to enable device-specific adaptation. The training data is the 25% subset from DCASE 2024 with no external data restrictions. The baseline system is reported to achieve 50.72% accuracy in the device-agnostic case and 51.89% after device-specific fine-tuning. Participation details note 31 submissions from 12 teams, with the top entry exceeding the baseline by more than 8 percentage points on the evaluation set.

Significance. If the reported 1.17 pp gain from device-specific fine-tuning holds under statistical scrutiny, the work would provide a useful reference point for how explicit device information can support adaptation to hardware mismatch in real-world ASC deployments. The emphasis on low-complexity models and transfer learning continues a practically relevant thread in the DCASE series, and the observed participation indicates community interest. The top submission's larger gain also highlights the headroom for further progress.

major comments (1)
  1. [Abstract] Abstract (baseline accuracies): The central empirical claim is that device information at inference enables effective adaptation, evidenced by the rise from 50.72% (device-agnostic) to 51.89% (device-specific fine-tuning). This 1.17 pp difference is presented without standard deviations, results from multiple random seeds, or any statistical significance test, so it is impossible to determine whether the gain exceeds typical run-to-run variability in ASC models and therefore whether the task modification produces a reliable signal.
minor comments (2)
  1. [Abstract] Abstract: The description of the baseline provides only high-level accuracy figures; a brief statement of the model architecture (e.g., CNN variant, parameter count) and evaluation protocol (e.g., cross-validation folds, exact fine-tuning procedure) would improve immediate readability.
  2. [Task description] Task description: The manuscript references the 25% training subset from DCASE 2024 but does not explicitly state whether the evaluation set composition or scene/device distribution matches prior years; a short comparison table would clarify continuity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the DCASE 2025 Low-Complexity Acoustic Scene Classification with Device Information task. The single major comment concerns the statistical robustness of the reported baseline improvement; we address this point directly below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (baseline accuracies): The central empirical claim is that device information at inference enables effective adaptation, evidenced by the rise from 50.72% (device-agnostic) to 51.89% (device-specific fine-tuning). This 1.17 pp difference is presented without standard deviations, results from multiple random seeds, or any statistical significance test, so it is impossible to determine whether the gain exceeds typical run-to-run variability in ASC models and therefore whether the task modification produces a reliable signal.

    Authors: We agree that the 1.17 pp improvement should be accompanied by measures of variability and a statistical test to allow readers to judge whether it exceeds typical run-to-run fluctuation. In the revised manuscript we will report baseline accuracies averaged over five independent random seeds together with standard deviations for both the device-agnostic and device-specific fine-tuning settings. We will also add a paired statistical significance test (McNemar’s test on the per-sample predictions) and state the resulting p-value. These additions will be placed in the abstract and in the experimental section describing the baseline. revision: yes

Circularity Check

0 steps flagged

Empirical baseline report with no derivations or self-referential fitting

full rationale

The paper is an empirical description of a DCASE 2025 challenge task and its baseline system. It reports measured accuracies (50.72% device-agnostic and 51.89% with device-specific fine-tuning) on held-out evaluation data with no equations, first-principles derivations, fitted parameters renamed as predictions, or mathematical claims that could reduce to their own inputs by construction. References to prior DCASE editions are contextual background rather than load-bearing self-citations justifying a uniqueness theorem or ansatz. The central results are externally falsifiable experimental measurements, not derived outputs, making the paper self-contained with no circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No theoretical derivations or new postulated entities; the paper is a challenge task definition and empirical baseline report.

pith-pipeline@v0.9.0 · 5743 in / 960 out tokens · 52850 ms · 2026-05-22T16:38:45.359572+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

    INTRODUCTION Acoustic Scene Classification (ASC) aims to identify the type of environment in which an audio recording was made, based on a short excerpt [1]. Environments are defined as a set of real-world locations, such as Metro station, Urban park , or Public square . The ASC task has a long-standing presence in the DCASE Chal- lenge, evolving through ...

  2. [2]

    The most commonly used meth- ods in 2023 and 2024 were augmentation-based methods, such as Freq-MixStyle [7,8] and device impulse response augmentation [9]

    PREVIOUS EDITIONS In past editions of the task, various strategies have been pro- posed to improve generalization across different—and potentially unknown—recording devices. The most commonly used meth- ods in 2023 and 2024 were augmentation-based methods, such as Freq-MixStyle [7,8] and device impulse response augmentation [9]. Other approaches aimed to ...

  3. [3]

    However, this year’s setup introduces key variations to the handling of device mismatch and transfer learning

    TASK SETUP As discussed in the previous section, device mismatch, low- complexity constraints, and transfer learning have been extensively studied in the context of the ASC task. However, this year’s setup introduces key variations to the handling of device mismatch and transfer learning. Regarding device mismatch, the recording de- vice ID is now provide...

  4. [4]

    It employs a receptive-field-regularized, factorized CNN architecture

    BASELINE SYSTEM Following the 2024 edition [5], the baseline system builds on a sim- plified variant of the top-performing submission from the 2023 edi- tion [25]. It employs a receptive-field-regularized, factorized CNN architecture. Audio recordings are first resampled to 32 kHz, then converted into mel spectrograms using a 4096-point FFT with a window ...

  5. [5]

    CHALLENGE RESULTS The challenge results will be added after the challenge has ended

  6. [6]

    Building on previous editions, we con- tinue to address challenges such as low-complexity constraints, de- vice mismatch, and data scarcity

    CONCLUSION This paper presented the setup and baseline system for Task 1 of the DCASE 2025 Challenge. Building on previous editions, we con- tinue to address challenges such as low-complexity constraints, de- vice mismatch, and data scarcity. A key refinement is the provision of device information at inference time, enabling device-specific modeling. The ...

  7. [7]

    ACKNOWLEDGMENT The LIT AI Lab is supported by the Federal State of Upper Austria. Gerhard Widmer’s work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 re- search and innovation programme, grant agreement No 101019375 (Whither Music?)

  8. [8]

    Approaches to complex sound scene analysis,

    E. Benetos, D. Stowell, and M. D. Plumbley, “Approaches to complex sound scene analysis,” in Cham: Springer International Publishing , 2018. 2Source Code: https://github.com/CPJKU/dcase2025 task1 baseline/tree/main Detection and Classification of Acoustic Scenes and Events 2025

  9. [9]

    Acoustic scene classifica- tion in DCASE 2020 challenge: Generalization across devices and low complexity solutions,

    T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classifica- tion in DCASE 2020 challenge: Generalization across devices and low complexity solutions,” inDCASE Workshop, 2020

  10. [10]

    Low- complexity acoustic scene classification for multi-device audio: Anal- ysis of DCASE 2021 challenge systems,

    I. Mart ´ın-Morat´o, T. Heittola, A. Mesaros, and T. Virtanen, “Low- complexity acoustic scene classification for multi-device audio: Anal- ysis of DCASE 2021 challenge systems,” inDCASE Workshop, 2021

  11. [11]

    Low-complexity acoustic scene classification in DCASE 2022 challenge,

    I. Mart ´ın-Morat´o, F. Paissan, A. Ancilotto, T. Heittola, A. Mesaros, E. Farella, A. Brutti, and T. Virtanen, “Low-complexity acoustic scene classification in DCASE 2022 challenge,” inDCASE Workshop, 2022

  12. [12]

    Data-efficient low-complexity acoustic scene classification in the DCASE 2024 challenge,

    F. Schmid, P. Primus, T. Heittola, A. Mesaros, I. Mart ´ın-Morat´o, K. Koutini, and G. Widmer, “Data-efficient low-complexity acoustic scene classification in the DCASE 2024 challenge,” in DCASE Work- shop, 2024

  13. [13]

    A multi-device dataset for urban acoustic scene classification,

    A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” inDCASE Workshop, 2018

  14. [14]

    Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification,

    B. Kim, S. Yang, J. Kim, H. Park, J. Lee, and S. Chang, “Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification,” inInterspeech, 2022

  15. [15]

    CP-JKU submission to DCASE22: Distilling knowledge for low-complexity convolutional neural networks from a patchout audio transformer,

    F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “CP-JKU submission to DCASE22: Distilling knowledge for low-complexity convolutional neural networks from a patchout audio transformer,” DCASE Challenge, Tech. Rep., 2022

  16. [16]

    Device-robust acoustic scene classification via impulse response augmentation,

    T. Morocutti, F. Schmid, K. Koutini, and G. Widmer, “Device-robust acoustic scene classification via impulse response augmentation,” in EUSIPCO, 2023

  17. [17]

    Ascdomain: Domain invari- ant device-adversarial isotropic knowledge distillation convolutional neural architecture,

    H. Truchan, T. H. Ngo, and Z. Ahmadi, “Ascdomain: Domain invari- ant device-adversarial isotropic knowledge distillation convolutional neural architecture,” inICASSP, 2025

  18. [18]

    CP-JKU submissions to DCASE’20: Low-complexity cross-device acoustic scene classification with RF-regularized CNNs,

    K. Koutini, F. Henkel, H. Eghbal-zadeh, and G. Widmer, “CP-JKU submissions to DCASE’20: Low-complexity cross-device acoustic scene classification with RF-regularized CNNs,” DCASE Challenge, Tech. Rep., 2020

  19. [19]

    QTI submission to DCASE 2021: Residual normalization for device-imbalanced acoustic scene classification with efficient design,

    B. Kim, S. Yang, J. Kim, and S. Chang, “QTI submission to DCASE 2021: Residual normalization for device-imbalanced acoustic scene classification with efficient design,” DCASE Challenge, Tech. Rep., 2021

  20. [20]

    Hyu submis- sion for the DCASE 2022: Efficient fine-tuning method using device- aware data-random-drop for device-imbalanced acoustic scene classi- fication,

    J.-H. Lee, J.-H. Choi, P. M. Byun, and J.-H. Chang, “Hyu submis- sion for the DCASE 2022: Efficient fine-tuning method using device- aware data-random-drop for device-imbalanced acoustic scene classi- fication,” DCASE Challenge, Tech. Rep., 2022

  21. [21]

    CPJKU submission to DCASE21: Cross-device audio scene classification with wide sparse frequency-damped CNNs,

    K. Koutini, J. Schl ¨uter, and G. Widmer, “CPJKU submission to DCASE21: Cross-device audio scene classification with wide sparse frequency-damped CNNs,” DCASE Challenge, Tech. Rep., 2021

  22. [22]

    Data-efficient acoustic scene classification via ensemble teachers distillation and pruning,

    H. Bing, H. Wen, C. Zhengyang, J. Anbai, C. Xie, F. Pingyi, L. Cheng, L. Zhiqiang, L. Jia, Z. Wei-Qiang, and Q. Yanmin, “Data-efficient acoustic scene classification via ensemble teachers distillation and pruning,” DCASE Challenge, Tech. Rep., 2024

  23. [23]

    A lottery ticket hy- pothesis framework for low-complexity device-robust neural acoustic scene classification,

    C.-H. H. Yang, H. Hu, S. M. Siniscalchi, Q. Wang, W. Yuyang, X. Xia, Y . Zhao, Y . Wu, Y . Wang, J. Du, and C.-H. Lee, “A lottery ticket hy- pothesis framework for low-complexity device-robust neural acoustic scene classification,” DCASE Challenge, Tech. Rep., 2021

  24. [24]

    Low-complexity acoustic scene classification using blueprint separable convolution and knowledge distillation,

    J. Tan and Y . Li, “Low-complexity acoustic scene classification using blueprint separable convolution and knowledge distillation,” DCASE Challenge, Tech. Rep., 2023

  25. [25]

    DCASE2023 task1 sub- mission: Device simulation and time-frequency separable convolu- tion for acoustic scene classification,

    Y . Cai, M. Lin, C. Zhu, S. Li, and X. Shao, “DCASE2023 task1 sub- mission: Device simulation and time-frequency separable convolu- tion for acoustic scene classification,” DCASE Challenge, Tech. Rep., 2023

  26. [26]

    CP-JKU submission to DCASE23: Efficient acoustic scene classifi- cation with cp-mobile,

    F. Schmid, T. Morocutti, S. Masoudian, K. Koutini, and G. Widmer, “CP-JKU submission to DCASE23: Efficient acoustic scene classifi- cation with cp-mobile,” DCASE Challenge, Tech. Rep., 2023

  27. [27]

    Low-complexity acoustic scene clas- sification with limited training data,

    Y .-F. Shao, P. Jiang, and W. Li, “Low-complexity acoustic scene clas- sification with limited training data,” DCASE Challenge, Tech. Rep., 2024

  28. [28]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inICASSP, 2017

  29. [29]

    DCASE2024 task1 submission: Data-efficient acoustic scene classification with self-supervised teach- ers,

    Y . Cai, M. Lin, S. Li, and X. Shao, “DCASE2024 task1 submission: Data-efficient acoustic scene classification with self-supervised teach- ers,” DCASE Challenge, Tech. Rep., 2024

  30. [30]

    Data-efficient acoustic scene classification with pre-trained CP-Mobile,

    N. David, R. Aida, and S. Patrick, “Data-efficient acoustic scene classification with pre-trained CP-Mobile,” DCASE Challenge, Tech. Rep., 2024

  31. [31]

    Upb-nt submission to DCASE24: Dataset pruning for targeted knowledge distillation,

    A. Werning and R. Haeb-Umbach, “Upb-nt submission to DCASE24: Dataset pruning for targeted knowledge distillation,” DCASE Chal- lenge, Tech. Rep., 2024

  32. [32]

    Distilling the knowledge of transformers and CNNs with CP-mobile,

    F. Schmid, T. Morocutti, S. Masoudian, K. Koutini, and G. Widmer, “Distilling the knowledge of transformers and CNNs with CP-mobile,” in DCASE Workshop, 2023