arxiv: 2601.02944 · v3 · submitted 2026-01-06 · 📡 eess.AS

XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

Kwok-Ho Ng , Tingting Song , Yongdong Wu , Zhihua Xia This is my paper

Pith reviewed 2026-05-16 16:59 UTC · model grok-4.3

classification 📡 eess.AS

keywords audio deepfake detectionMambastate space modelshybrid architecturesASVspoof 2021XLSR

0 comments

The pith

Scaling hybrid Mamba-Attention backbones with Hydra achieves competitive audio deepfake detection by capturing bidirectional temporal dependencies more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XLSR-MamBo as a modular framework that pairs an XLSR front-end with hybrid Mamba-Attention backbones for audio deepfake detection. It tests four topological designs built from Mamba, Mamba2, Hydra, and Gated DeltaNet variants. The central result is that the MamBo-3-Hydra-N3 configuration matches state-of-the-art systems on ASVspoof 2021 LA, DF, and In-the-Wild while generalizing to unseen diffusion- and flow-matching methods on DFADD. Hydra's native bidirectional modeling replaces earlier heuristic dual-branch designs, and increasing backbone depth reduces the variance and instability seen in shallower versions.

Core claim

We propose XLSR-MamBo, a framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. Systematic evaluation of four topological designs using Mamba, Mamba2, Hydra, and Gated DeltaNet shows that the MamBo-3-Hydra-N3 configuration reaches competitive performance on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This stems from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than heuristic dual-branch strategies in prior work. Scaling backbone depth further reduces performance variance and instability, while DFADD results confirm robust generalization to unseen synthesis methods.

What carries the argument

The XLSR-MamBo modular framework that combines an XLSR front-end with hybrid Mamba-Attention backbones in evaluated topologies, particularly the MamBo-3-Hydra-N3 design.

If this is right

Hydra's native bidirectional modeling captures holistic temporal dependencies more efficiently than heuristic dual-branch strategies.
Scaling backbone depth mitigates the performance variance and instability observed in shallower models.
The hybrid framework generalizes robustly to unseen diffusion- and flow-matching-based synthesis methods on DFADD.
Hybrid architectures can effectively capture artifacts in spoofed speech signals for audio deepfake detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same depth-scaling pattern could be tested on other audio classification tasks that rely on long-range temporal structure.
If bidirectional SSMs prove more stable at depth, the approach might extend to multimodal detection settings where frequency and time artifacts interact.
Further increases in backbone depth could be explored to determine whether performance continues to improve or saturates on current benchmarks.

Load-bearing premise

That Hydra's native bidirectional modeling captures holistic temporal dependencies more efficiently than prior heuristic dual-branch strategies and that increasing backbone depth generally reduces performance variance and instability.

What would settle it

A direct comparison showing that the MamBo-3-Hydra-N3 configuration fails to reach competitive performance against state-of-the-art systems on the ASVspoof 2021 LA, DF, or In-the-Wild benchmarks.

read the original abstract

Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This performance benefits from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than the heuristic dual-branch strategies employed in prior works. Furthermore, evaluations on the DFADD dataset demonstrate robust generalization to unseen diffusion- and flow-matching-based synthesis methods. Crucially, our analysis reveals that scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models. These results demonstrate the hybrid framework's ability to capture artifacts in spoofed speech signals, providing an effective method for ADD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies hybrid Mamba-attention backbones to audio deepfake detection and gets competitive benchmark numbers, but the key claim on depth scaling reducing variance has no supporting stats.

read the letter

This paper applies hybrid Mamba-attention backbones to audio deepfake detection and gets competitive benchmark numbers, but the key claim on depth scaling reducing variance has no supporting stats. The main new element is the MamBo-3-Hydra-N3 configuration and its results on ASVspoof 2021 LA, DF, In-the-Wild plus generalization to DFADD for unseen diffusion and flow-matching methods. They run a clean comparison across four topologies using Mamba, Mamba2, Hydra, and Gated DeltaNet on an XLSR front-end, which is a reasonable extension of existing hybrid SSM work rather than a new framework. The bidirectional modeling in Hydra is presented as more efficient than prior heuristic dual-branch designs for capturing global frequency artifacts, and the modular setup keeps complexity linear. That part is straightforward and useful for anyone already working with sequence models on audio. The soft spot is the assertion that scaling backbone depth mitigates variance and instability. The text supplies no per-run standard deviations, no multi-seed averages, and no statistical tests comparing the N1/N2/N3 variants, so the stability benefit cannot be separated from single-run effects. The abstract also gives no concrete EER numbers, which makes it hard to judge exactly how competitive the results are. If the full paper includes detailed ablations and error bars, that would fix the gap; otherwise the central advantage over earlier designs rests on unverified claims. This is for researchers building practical audio deepfake detectors who want to try efficient SSM hybrids on real benchmarks. A reader focused on audio security or efficient sequence modeling would get concrete topology ideas and the DFADD generalization test. I would send it to peer review. The experiments target a timely problem with reproducible benchmarks, and a referee could push for the missing statistics without major rework.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes XLSR-MamBo, a modular hybrid framework pairing an XLSR front-end with Mamba-Attention backbones that incorporate Mamba, Mamba2, Hydra, and Gated DeltaNet variants. It evaluates four topological designs and reports that the MamBo-3-Hydra-N3 configuration attains competitive performance on ASVspoof 2021 LA, DF, and In-the-Wild benchmarks, attributes gains to Hydra's native bidirectional modeling, shows generalization on DFADD, and concludes that scaling backbone depth mitigates variance and instability seen in shallower models.

Significance. If the empirical results and the depth-scaling claim are substantiated with statistical rigor, the work would strengthen the case for hybrid SSM-attention architectures in audio deepfake detection by demonstrating efficient capture of global artifacts and improved stability, offering a practical alternative to prior dual-branch heuristics.

major comments (2)

[Abstract] Abstract: The load-bearing claim that 'scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models' is unsupported by any reported per-run standard deviations, multi-seed averages, or hypothesis tests comparing the N1/N2/N3 configurations; without these the assertion cannot be distinguished from single-run effects.
[Experimental results] Experimental results: The abstract asserts competitive EER on ASVspoof 2021 LA/DF/In-the-Wild and DFADD generalization but supplies no concrete metrics, baseline comparisons, ablation tables, or error bars, preventing verification of the data-to-claim linkage for the MamBo-3-Hydra-N3 configuration.

minor comments (2)

Define acronyms (XLSR, EER, SSM, ADD) on first use and ensure consistent notation for model variants (e.g., MamBo-3-Hydra-N3) throughout.
Add error bars or variance indicators to all performance tables and figures to improve clarity of the stability claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the statistical rigor and clarity of our claims. We address each major comment point by point below and commit to revisions that improve the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim that 'scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models' is unsupported by any reported per-run standard deviations, multi-seed averages, or hypothesis tests comparing the N1/N2/N3 configurations; without these the assertion cannot be distinguished from single-run effects.

Authors: We acknowledge that the current manuscript reports only single-run results for the N1/N2/N3 depth variants and does not include per-run standard deviations, multi-seed averages, or formal hypothesis tests. The depth-scaling observation was derived from consistent trends across our experiments, but we agree this falls short of statistical substantiation. In the revision we will rerun the N1, N2, and N3 configurations with at least five random seeds, report mean EER and standard deviations, and add paired statistical tests comparing the configurations. The abstract will be updated to reflect these new results. revision: yes
Referee: [Experimental results] Experimental results: The abstract asserts competitive EER on ASVspoof 2021 LA/DF/In-the-Wild and DFADD generalization but supplies no concrete metrics, baseline comparisons, ablation tables, or error bars, preventing verification of the data-to-claim linkage for the MamBo-3-Hydra-N3 configuration.

Authors: The full manuscript contains tables reporting concrete EER values for MamBo-3-Hydra-N3 on ASVspoof 2021 LA/DF/In-the-Wild, direct comparisons against baselines including AASIST and other hybrid models, ablation results across the four topological designs, and DFADD generalization numbers. However, the abstract is indeed too high-level and omits these specifics. We will revise the abstract to include the key EER figures for the MamBo-3-Hydra-N3 model together with the strongest baselines, explicitly reference the relevant tables and figures, and incorporate error bars once the multi-seed experiments are completed. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical benchmark results with no self-referential reductions

full rationale

The paper proposes XLSR-MamBo, a hybrid Mamba-Attention backbone, and reports competitive EER on ASVspoof 2021 LA/DF/In-the-Wild plus DFADD generalization for the MamBo-3-Hydra-N3 configuration. Central statements attribute performance to Hydra's bidirectional modeling and depth scaling mitigating variance, but these are presented as observations from systematic evaluations of four topological designs using Mamba, Mamba2, Hydra, and Gated DeltaNet variants. No equations, parameter fits, or derivations appear that reduce by construction to their own inputs. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are smuggled in. The derivation chain is therefore self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about state-space models providing linear complexity and hybrid architectures being able to capture both local and global audio artifacts; no new free parameters, axioms, or invented entities are introduced beyond standard deep-learning components.

axioms (2)

standard math State space models offer linear complexity
Invoked in the abstract as background motivation for using SSMs.
domain assumption Hybrid Mamba-Attention backbones can capture global frequency-domain artifacts better than pure causal SSMs
Core premise stated in the abstract to justify the hybrid design.

pith-pipeline@v0.9.0 · 5550 in / 1298 out tokens · 59083 ms · 2026-05-16T16:59:47.138636+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators
cs.SD 2026-03 unverdicted novelty 4.0

Balancing diverse bonafide resources and AI generators in training data is the key to building general deepfake speech detection models.