XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection
Pith reviewed 2026-05-16 16:59 UTC · model grok-4.3
The pith
Scaling hybrid Mamba-Attention backbones with Hydra achieves competitive audio deepfake detection by capturing bidirectional temporal dependencies more efficiently.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose XLSR-MamBo, a framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. Systematic evaluation of four topological designs using Mamba, Mamba2, Hydra, and Gated DeltaNet shows that the MamBo-3-Hydra-N3 configuration reaches competitive performance on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This stems from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than heuristic dual-branch strategies in prior work. Scaling backbone depth further reduces performance variance and instability, while DFADD results confirm robust generalization to unseen synthesis methods.
What carries the argument
The XLSR-MamBo modular framework that combines an XLSR front-end with hybrid Mamba-Attention backbones in evaluated topologies, particularly the MamBo-3-Hydra-N3 design.
If this is right
- Hydra's native bidirectional modeling captures holistic temporal dependencies more efficiently than heuristic dual-branch strategies.
- Scaling backbone depth mitigates the performance variance and instability observed in shallower models.
- The hybrid framework generalizes robustly to unseen diffusion- and flow-matching-based synthesis methods on DFADD.
- Hybrid architectures can effectively capture artifacts in spoofed speech signals for audio deepfake detection.
Where Pith is reading between the lines
- The same depth-scaling pattern could be tested on other audio classification tasks that rely on long-range temporal structure.
- If bidirectional SSMs prove more stable at depth, the approach might extend to multimodal detection settings where frequency and time artifacts interact.
- Further increases in backbone depth could be explored to determine whether performance continues to improve or saturates on current benchmarks.
Load-bearing premise
That Hydra's native bidirectional modeling captures holistic temporal dependencies more efficiently than prior heuristic dual-branch strategies and that increasing backbone depth generally reduces performance variance and instability.
What would settle it
A direct comparison showing that the MamBo-3-Hydra-N3 configuration fails to reach competitive performance against state-of-the-art systems on the ASVspoof 2021 LA, DF, or In-the-Wild benchmarks.
read the original abstract
Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This performance benefits from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than the heuristic dual-branch strategies employed in prior works. Furthermore, evaluations on the DFADD dataset demonstrate robust generalization to unseen diffusion- and flow-matching-based synthesis methods. Crucially, our analysis reveals that scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models. These results demonstrate the hybrid framework's ability to capture artifacts in spoofed speech signals, providing an effective method for ADD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes XLSR-MamBo, a modular hybrid framework pairing an XLSR front-end with Mamba-Attention backbones that incorporate Mamba, Mamba2, Hydra, and Gated DeltaNet variants. It evaluates four topological designs and reports that the MamBo-3-Hydra-N3 configuration attains competitive performance on ASVspoof 2021 LA, DF, and In-the-Wild benchmarks, attributes gains to Hydra's native bidirectional modeling, shows generalization on DFADD, and concludes that scaling backbone depth mitigates variance and instability seen in shallower models.
Significance. If the empirical results and the depth-scaling claim are substantiated with statistical rigor, the work would strengthen the case for hybrid SSM-attention architectures in audio deepfake detection by demonstrating efficient capture of global artifacts and improved stability, offering a practical alternative to prior dual-branch heuristics.
major comments (2)
- [Abstract] Abstract: The load-bearing claim that 'scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models' is unsupported by any reported per-run standard deviations, multi-seed averages, or hypothesis tests comparing the N1/N2/N3 configurations; without these the assertion cannot be distinguished from single-run effects.
- [Experimental results] Experimental results: The abstract asserts competitive EER on ASVspoof 2021 LA/DF/In-the-Wild and DFADD generalization but supplies no concrete metrics, baseline comparisons, ablation tables, or error bars, preventing verification of the data-to-claim linkage for the MamBo-3-Hydra-N3 configuration.
minor comments (2)
- Define acronyms (XLSR, EER, SSM, ADD) on first use and ensure consistent notation for model variants (e.g., MamBo-3-Hydra-N3) throughout.
- Add error bars or variance indicators to all performance tables and figures to improve clarity of the stability claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the statistical rigor and clarity of our claims. We address each major comment point by point below and commit to revisions that improve the manuscript without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The load-bearing claim that 'scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models' is unsupported by any reported per-run standard deviations, multi-seed averages, or hypothesis tests comparing the N1/N2/N3 configurations; without these the assertion cannot be distinguished from single-run effects.
Authors: We acknowledge that the current manuscript reports only single-run results for the N1/N2/N3 depth variants and does not include per-run standard deviations, multi-seed averages, or formal hypothesis tests. The depth-scaling observation was derived from consistent trends across our experiments, but we agree this falls short of statistical substantiation. In the revision we will rerun the N1, N2, and N3 configurations with at least five random seeds, report mean EER and standard deviations, and add paired statistical tests comparing the configurations. The abstract will be updated to reflect these new results. revision: yes
-
Referee: [Experimental results] Experimental results: The abstract asserts competitive EER on ASVspoof 2021 LA/DF/In-the-Wild and DFADD generalization but supplies no concrete metrics, baseline comparisons, ablation tables, or error bars, preventing verification of the data-to-claim linkage for the MamBo-3-Hydra-N3 configuration.
Authors: The full manuscript contains tables reporting concrete EER values for MamBo-3-Hydra-N3 on ASVspoof 2021 LA/DF/In-the-Wild, direct comparisons against baselines including AASIST and other hybrid models, ablation results across the four topological designs, and DFADD generalization numbers. However, the abstract is indeed too high-level and omits these specifics. We will revise the abstract to include the key EER figures for the MamBo-3-Hydra-N3 model together with the strongest baselines, explicitly reference the relevant tables and figures, and incorporate error bars once the multi-seed experiments are completed. revision: yes
Circularity Check
No circularity; claims rest on empirical benchmark results with no self-referential reductions
full rationale
The paper proposes XLSR-MamBo, a hybrid Mamba-Attention backbone, and reports competitive EER on ASVspoof 2021 LA/DF/In-the-Wild plus DFADD generalization for the MamBo-3-Hydra-N3 configuration. Central statements attribute performance to Hydra's bidirectional modeling and depth scaling mitigating variance, but these are presented as observations from systematic evaluations of four topological designs using Mamba, Mamba2, Hydra, and Gated DeltaNet variants. No equations, parameter fits, or derivations appear that reduce by construction to their own inputs. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are smuggled in. The derivation chain is therefore self-contained against external benchmarks rather than circular.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math State space models offer linear complexity
- domain assumption Hybrid Mamba-Attention backbones can capture global frequency-domain artifacts better than pure causal SSMs
Forward citations
Cited by 1 Pith paper
-
A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators
Balancing diverse bonafide resources and AI generators in training data is the key to building general deepfake speech detection models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.