pith. sign in

arxiv: 2607.00247 · v1 · pith:73CFWKDInew · submitted 2026-06-30 · 💻 cs.SD · cs.AI

Adaptive Perturbation Selection for Contrastive Audio Decoding

Pith reviewed 2026-07-02 16:57 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords contrastive decodingaudio-language modelshallucination mitigationperturbation selectionadaptive routingaudio transformationsexistence task
0
0 comments X

The pith

A lightweight selector trained on hidden states dynamically routes optimal audio perturbations in contrastive decoding to cut hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large audio-language models can be steered away from language-prior overrides of acoustic evidence by moving beyond fixed perturbations in contrastive decoding. It first shows that a binary yes/no prompt constraint lowers false confirmations of missing features. It then maps a library of targeted transformations across temporal, spectral, frequency and amplitude domains, finding that the best choice is task-dependent. Finally it demonstrates that a small selector built on the base model's hidden states can pick the right negative branch per example and task.

Core claim

Evaluating targeted audio perturbations across domains reveals task-dependent optima, such as audio reversal raising temporal-order accuracy from 74.7 percent to 81.4 percent; a lightweight selector trained on hidden states then routes the best negative branch per example, adding a further 4.3 percent gain on the existence task.

What carries the argument

The lightweight perturbation selector, which reads the base model's hidden states to choose the most effective negative audio branch from a library of temporal, spectral, frequency and amplitude transformations.

If this is right

  • Task-specific transformations such as reversing the audio array improve accuracy on temporal-order questions.
  • A binary yes/no prompt constraint reduces the model's tendency to falsely confirm absent audio features.
  • The selector yields additional accuracy gains on the existence task while leaving the base model unchanged.
  • Optimal perturbations differ across temporal, spectral, frequency and amplitude domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hidden-state routing idea could be tested on vision-language models where language priors also override sensory evidence.
  • If the selector generalizes, it would lower the cost of manually tuning perturbations for each new audio task.
  • Running the selector at inference time adds only light overhead compared with retraining or prompt search.

Load-bearing premise

Hidden states from the base model contain enough information to select an optimal perturbation without the selector overfitting to the specific tasks and examples evaluated.

What would settle it

Apply the trained selector to new audio-language tasks or datasets outside the training distribution and measure whether it still outperforms fixed perturbations or random selection.

Figures

Figures reproduced from arXiv: 2607.00247 by Aaron Isidore Grace, Weiran Wang, Zhouyuan Huo.

Figure 1
Figure 1. Figure 1: System overview. The selector chooses a perturbation based on text and audio embeddings. Both branches are forwarded through the LALM, and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Selector training. Offline CD evaluation yields a multi-hot correctness [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy across α values. Helpful perturbations improve monotonically up to α ≈ 1.0, sometimes with marginal gains just beyond; harmful perturbations degrade monotonically throughout. unmodified baseline (67.8%). Because neither removes acoustic content, the target sounds remain fully audible, providing no useful contrastive signal for existence. AH Order. This pattern reverses for AH Order, where temporal… view at source ↗
Figure 4
Figure 4. Figure 4: Distance-based branch selection on Qwen2 AH Existence ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Large audio-language models (LALMs) frequently hallucinate by overriding acoustic evidence with language priors. While contrastive decoding (CD) offers training-free mitigation, existing methods rely on blunt perturbations like masking or noise, leaving structured audio transformations unexplored. We explore this design space by evaluating a diverse library of targeted audio perturbations and adaptively selecting the optimal negative branch for each task and example. First, we improve upon earlier prompt engineering by showing that a simple binary yes/no constraint reduces the model's tendency to falsely confirm absent audio features. Second, evaluating our library across temporal, spectral, frequency, and amplitude domains reveals that optimal transformations are highly task-dependent; for instance, reversing the audio array disrupts temporal coherence, raising accuracy on the temporal order task from 74.7% to 81.4%. Finally, we trained a light-weight perturbation selector on model hidden states to dynamically route negative branches, yielding an additional +4.3% gain on the existence task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper explores structured audio perturbations for contrastive decoding in large audio-language models to reduce hallucinations. It shows that a binary yes/no prompt constraint helps, that optimal perturbations are highly task-dependent (e.g., time reversal improves temporal-order accuracy from 74.7% to 81.4%), and that a lightweight selector trained on base-model hidden states can dynamically choose the negative branch, adding +4.3% on the existence task.

Significance. If the reported gains hold under proper validation, the work would demonstrate that internal representations contain usable signals for routing contrastive branches without retraining the LALM, extending training-free decoding methods with a small learned component. The systematic evaluation of a perturbation library across temporal/spectral domains is a clear strength.

major comments (2)
  1. [Abstract] Abstract: the +4.3% gain on the existence task from the perturbation selector is stated without any information on dataset size, number of examples, choice of baselines, statistical significance testing, or controls for multiple comparisons. This information is load-bearing for the central empirical claim.
  2. [Abstract] Description of the selector (final paragraph): no mention is made of task-level hold-out, cross-task validation, or example-level splitting when training the selector on hidden states. Given the paper's own statement that optimal perturbations are highly task-dependent, this omission leaves open the possibility that the reported gain reflects overfitting rather than transferable features in the hidden states.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional details are needed to support the central empirical claims and will revise the abstract accordingly. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the +4.3% gain on the existence task from the perturbation selector is stated without any information on dataset size, number of examples, choice of baselines, statistical significance testing, or controls for multiple comparisons. This information is load-bearing for the central empirical claim.

    Authors: We agree that the abstract should include these supporting details for the reported gain. In the revision we will expand the relevant sentence to state the number of examples evaluated on the existence task, the specific baselines against which the selector was compared, and that the improvement was assessed for statistical significance with appropriate correction for multiple comparisons. The full experimental protocol and results tables already appear in Section 4; the abstract will now reference them explicitly. revision: yes

  2. Referee: [Abstract] Description of the selector (final paragraph): no mention is made of task-level hold-out, cross-task validation, or example-level splitting when training the selector on hidden states. Given the paper's own statement that optimal perturbations are highly task-dependent, this omission leaves open the possibility that the reported gain reflects overfitting rather than transferable features in the hidden states.

    Authors: We acknowledge that the abstract omits the validation procedure. The selector was trained with example-level random splits within each task (no test-example leakage) while keeping the perturbation library fixed per task; no cross-task training was performed. We will add a concise clause to the abstract describing this splitting strategy. Because the selector operates on hidden states of the frozen base model and is deliberately lightweight, the per-example routing generalizes beyond the training split; we will also note this in the revision to address the overfitting concern. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation reducing to fitted inputs or self-citations by construction

full rationale

The paper reports experimental results from evaluating a library of audio perturbations across tasks and training a lightweight selector on hidden states, with the +4.3% gain presented as a measured outcome on the existence task. No equations, first-principles derivations, or self-citation chains are invoked as load-bearing steps; the central claims rest on direct empirical evaluation rather than any quantity that reduces to its own inputs by construction. This is self-contained against external benchmarks as standard ML experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of all free parameters and axioms; the approach implicitly assumes that contrastive decoding with perturbations is a valid mitigation strategy and that hidden states are informative for selection.

pith-pipeline@v0.9.1-grok · 5693 in / 958 out tokens · 24653 ms · 2026-07-02T16:57:48.791359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 26 canonical work pages · 7 internal anchors

  1. [1]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2310.13289

  2. [2]

    Pengi: An audio language model for audio tasks,

    S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2305.11834

  3. [3]

    Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2305.10790

  4. [4]

    SALMONN-omni: A standalone speech LLM without codec injection for full-duplex conversation,

    W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “SALMONN-omni: A standalone speech LLM without codec injection for full-duplex conversation,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2025. [Online]. Available: https://arxiv.org/abs/2505.17060

  5. [5]

    Luet al.(2026) DeSTA2.5-Audio: Toward general-purpose large audio language model with self-generated cross-modal alignment

    K.-H. Luet al.(2026) DeSTA2.5-Audio: Toward general-purpose large audio language model with self-generated cross-modal alignment. [Online]. Available: https://arxiv.org/abs/2507.02768

  6. [6]

    WavLLM: Towards robust and adaptive speech large language model,

    S. Hu, L. Zhou, S. Liu, S. Chen, L. Meng, H. Hao, J. Pan, X. Liu, J. Li, S. Sivasankaran, L. Liu, and F. Wei, “WavLLM: Towards robust and adaptive speech large language model,” in Findings Assoc. Comput. Linguist. (EMNLP), 2024. [Online]. Available: https://arxiv.org/abs/2404.00656

  7. [7]

    Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning,

    C.-Y . Kuan and H.-y. Lee, “Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2025. [Online]. Available: https://arxiv.org/abs/2410.16130

  8. [8]

    Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,

    C.-Y . Kuan, W.-P. Huang, and H.-y. Lee, “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,” inProc. Interspeech, 2024. [Online]. Available: https://arxiv.org/abs/2406.08402

  9. [9]

    HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    F. Zhao, Y . Chen, W. Lu, D. Zhang, X. Yue, and J. Wei, “HalluAudio: A comprehensive benchmark for hallucination detection in large audio-language models,” inProc. Annu. Meet. Assoc. Comput. Linguist. (ACL), 2026. [Online]. Available: https://arxiv.org/abs/2604.19300

  10. [10]

    AHa-Bench: Benchmarking audio hallucinations in large audio-language models,

    X. Cheng, D. Fu, C. Wen, S. Yu, Z. Wang, S. Ji, S. Arora, T. Jin, S. Watanabe, and Z. Zhao, “AHa-Bench: Benchmarking audio hallucinations in large audio-language models,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2025. [Online]. Available: https://openreview.net/forum?id=vCej5sO61x

  11. [11]

    Avhbench: A cross-modal hallucination benchmark for audio-visual large language models,

    K. Sung-Bin, O. Hyun-Bin, J. Lee, A. Senocak, J. S. Chung, and T.-H. Oh, “Avhbench: A cross-modal hallucination benchmark for audio-visual large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2410.18325

  12. [12]

    Contrastive decoding: Open-ended text generation as optimization,

    X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis, “Contrastive decoding: Open-ended text generation as optimization,” inProc. Annu. Meet. Assoc. Comput. Linguist. (ACL), 2023. [Online]. Available: https://arxiv.org/abs/2210.15097

  13. [13]

    Reducing object hallucination in large audio-language models via audio-aware decoding,

    T.-w. Hsu, K.-H. Lu, C.-H. Chiang, and H.-y. Lee, “Reducing object hallucination in large audio-language models via audio-aware decoding,” inProc. IEEE Autom. Speech Recognit. Underst. Workshop (ASRU),

  14. [14]

    Available: https://arxiv.org/abs/2506.07233

    [Online]. Available: https://arxiv.org/abs/2506.07233

  15. [15]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-Audio technical report,” Qwen Team, Alibaba Group, Tech. Rep., 2024. [Online]. Available: https://arxiv.org/abs/2407.10759

  16. [16]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio Flamingo 3: Advancing audio intelligence with fully open large audio language models,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2025. [Online]. Available: https://arxiv.org/abs/2507.08128

  17. [17]

    Clotho-AQA: A crowdsourced dataset for audio question answering,

    S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen, “Clotho-AQA: A crowdsourced dataset for audio question answering,” inProc. Eur. Signal Process. Conf. (EUSIPCO), 2022. [Online]. Available: https://arxiv.org/abs/2204.09634

  18. [18]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding,

    S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing, “Mitigating object hallucinations in large vision-language models through visual contrastive decoding,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024. [Online]. Available: https://arxiv.org/abs/2311.16922

  19. [19]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. Glass, and P. He, “DoLa: Decoding by contrasting layers improves factuality in large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2309.03883

  20. [20]

    A VCD: Mitigating hallucinations in audio-visual large language models through contrastive decoding,

    C. Jung, Y . Jang, and J. S. Chung, “A VCD: Mitigating hallucinations in audio-visual large language models through contrastive decoding,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2025. [Online]. Available: https://arxiv.org/abs/2505.20862

  21. [21]

    S. Kim, B. Cho, S. Bae, S. Ahn, and S.-Y . Yun. (2024) V ACoDe: Visual augmented contrastive decoding. arXiv preprint arXiv:2408.05337. [Online]. Available: https://arxiv.org/abs/2408.05337

  22. [22]

    Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

    Y . Li, Y . Liu, Z. Song, Y . Wei, M. Tak´aˇc, and S. Lahlou, “Temporal contrastive decoding: A training-free method for large audio-language models,” 2026. [Online]. Available: https://arxiv.org/abs/2604.15383

  23. [23]

    Lin, W.-P

    T.-Q. Lin, W.-P. Huang, Y .-C. Lin, and H.-y. Lee. (2026) How contrastive decoding enhances large audio language models? arXiv preprint arXiv:2603.09232. [Online]. Available: https://arxiv.org/abs/2603.09232

  24. [24]

    Scipy 1.0: fundamental algorithms for scientific computing in python,

    P. Virtanenet al., “Scipy 1.0: fundamental algorithms for scientific computing in python,”Nature Methods, vol. 17, no. 3, pp. 261–272, Feb

  25. [25]
  26. [26]

    librosa: Audio and music signal analysis in python,

    B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” SciPy 2015, 2015. [Online]. Available: https://doi.org/10.25080/Majora- 7b98e3ed-003

  27. [27]

    CompA: Addressing the gap in compositional reasoning in audio-language models,

    S. Ghosh, A. Seth, S. Kumar, U. Tyagi, C. K. Evuru, S. Ramaneswaran, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “CompA: Addressing the gap in compositional reasoning in audio-language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2310.08753

  28. [28]

    Freesound technical demo,

    F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in Proceedings of the 21st ACM International Conference on Multimedia, ser. MM ’13. Barcelona, Spain: ACM, 2013, pp. 411–412

  29. [29]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2023. [Online]. Available: https://arxiv.org/abs/2212.04356

  30. [30]

    mdpo: Conditional preference optimization for multimodal large language models,

    F. Wang, W. Zhou, J. Y . Huang, N. Xu, S. Zhang, H. Poon, and M. Chen, “mdpo: Conditional preference optimization for multimodal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.11839