Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

· 2026 · cs.SD · arXiv 2606.07473

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

representative citing papers

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Sparse autoencoders on a TTS language model yield interpretable features that causally control attributes such as laughter, gender, and speech rate via targeted interventions.

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

cs.SD · 2026-06-05 · unverdicted · novelty 6.0

Hallucination information is linearly separable in Whisper activations and SAE latents; SAE steering reduces hallucination rates from 72.63% to 14.11% (small) and 86.88% to 27.33% (large-v3) on non-speech audio with small WER impact.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders cs.LG · 2026-06-08 · unverdicted · none · ref 15 · internal anchor
Sparse autoencoders on a TTS language model yield interpretable features that causally control attributes such as laughter, gender, and speech rate via targeted interventions.
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders cs.SD · 2026-06-05 · unverdicted · none · ref 2 · internal anchor
Hallucination information is linearly separable in Whisper activations and SAE latents; SAE steering reduces hallucination rates from 72.63% to 14.11% (small) and 86.88% to 27.33% (large-v3) on non-speech audio with small WER impact.

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

fields

years

verdicts

representative citing papers

citing papers explorer