pith. machine review for the scientific record. sign in

arxiv: 2605.08186 · v1 · submitted 2026-05-05 · 📡 eess.AS · cs.AI· cs.LG

Recognition: no theorem link

Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.LG
keywords test-time adaptationentropy minimizationautoregressive modelspolicy gradientWhisperspeech recognitionadaptation
0
0 comments X

The pith

Entropy minimization for test-time adaptation in autoregressive models decomposes into token-level policy gradient and entropy losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the lack of a unified theory for using entropy minimization to adapt autoregressive models at test time, where previous work used separate heuristics like teacher forcing or reinforcement learning. It derives the exact objective for these models and shows it breaks down into a token-by-token policy gradient loss plus a token entropy loss. This allows reinterpreting earlier techniques as incomplete parts of the same framework. When tested on the Whisper automatic speech recognition model, the approach improves results in over twenty different settings involving noise, accents, and multiple languages. A clear understanding here matters because it provides a mathematical basis for making generative models more robust without needing new training data.

Core claim

By deriving a rigorous formulation of entropy minimization tailored to autoregressive models, the exact objective naturally decomposes into a token-level policy gradient loss and a token-level entropy loss. Prior methods are reinterpreted as partial realizations of this unified formulation.

What carries the argument

The exact decomposition of the entropy minimization objective into a token-level policy gradient loss and a token-level entropy loss.

If this is right

  • Previous test-time adaptation methods for autoregressive models are partial realizations of the unified entropy minimization objective.
  • The full formulation can be applied to improve performance in automatic speech recognition across diverse conditions.
  • Consistent gains are observed in more than 20 domains including acoustic noise, accents, and multilingual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-level decomposition could be used to design more efficient adaptation algorithms for other sequence generation tasks like language modeling.
  • This framework might help combine entropy minimization with other test-time techniques in a principled way.
  • Similar derivations could apply to different autoregressive architectures beyond the Whisper model used in the experiments.

Load-bearing premise

The entropy minimization objective for autoregressive models admits an exact decomposition into token-level policy gradient and entropy losses without additional approximations.

What would settle it

Computing the entropy minimization objective directly on a small autoregressive model and checking if it equals the sum of the proposed token-level policy gradient loss and token-level entropy loss.

Figures

Figures reproduced from arXiv: 2605.08186 by Chee-En Yu, Guan-Ting Lin, Hung-yi Lee, Wei-Ping Huang.

Figure 1
Figure 1. Figure 1: The correct EM objective for the autoregressive model decomposes into a token-level policy gradient loss and a token￾level entropy loss. See Section 3 for the details. of implementation strategies, revealing a theoretical divide in how the EM objective should be implemented for autoregres￾sive models. In [6, 9, 10, 11], EM for autoregressive mod￾els is reduced to teacher-forcing-style adaptation using pseu… view at source ↗
Figure 2
Figure 2. Figure 2: WER(%) across varying sample sizes for Gaussian Noise (left), Spanish Accent (middle), and Polish (right) domains. Com￾pares the full token-level objective (EM-tok) against its two components (PG-tok, ENT-tok) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: WER(%) across varying sample sizes for Gaussian Noise (left), Spanish Accent (middle), and Polish (right) domains. Com￾pares EM-tok and the variants utilizing beam search transcriptions (EM-tok-b, PG-tok-b, and ENT-tok-b) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: compares the WER against the average runtime (in seconds) for a 1-second utterance across different methods on the LS-GS-10 dataset. We evaluate the proposed TTA methods with sample sizes G ∈ {4, 16, 64} and Greedy-EM. Our results demonstrate that EM-tok-b with G = 16 provides a good balance between adaptation performance and computational overhead. While Greedy-EM is computation￾ [PITH_FULL_IMAGE:figures… view at source ↗
read the original abstract

Test-Time Adaptation (TTA) via entropy minimization (EM) has proven effective for classification tasks, yet its application to generative autoregressive models remains theoretically fragmented. Existing approaches typically rely on distinct heuristics, such as teacher forcing with pseudo labels or policy-gradient-based reinforcement learning, without a unified mathematical foundation. In this work, we resolve this discrepancy by deriving a rigorous formulation of EM tailored to autoregressive models. We show that the exact objective naturally decomposes into a token-level policy gradient loss and a token-level entropy loss, and we reinterpret prior methods as partial realizations of this unified formulation. Using Whisper ASR as a testbed, we demonstrate that our approach consistently improves performance across more than 20 diverse domains, including acoustic noise, accents, and multilingual settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper derives a rigorous formulation of entropy minimization for test-time adaptation in autoregressive models. Applying the chain rule to the joint sequence entropy H(y|x) yields an exact decomposition into a token-level policy-gradient loss (arising from the expectation over model predictions) and a token-level entropy loss. Prior heuristics are recovered exactly as special cases by dropping terms or fixing conditioning. Experiments on Whisper ASR demonstrate consistent gains across more than 20 domains including noise, accents, and multilingual settings.

Significance. If the derivation holds, the work supplies a unified, exact mathematical foundation for entropy-minimization TTA in generative autoregressive models, replacing the previously fragmented heuristics. The parameter-free character of the decomposition and the recovery of earlier methods as partial realizations are clear strengths. The broad empirical validation on Whisper across diverse domains further supports practical relevance.

minor comments (3)
  1. §3.2, Eq. (7): the transition from the joint entropy to the token-wise expectation is stated without an explicit statement of the measure under which the outer expectation is taken; adding one sentence would remove any residual ambiguity.
  2. Table 2: the baseline columns do not report the number of adaptation steps or the learning-rate schedule used for the compared methods; this makes direct comparison of computational cost difficult.
  3. §4.3: the multilingual results would benefit from a short discussion of whether the token-level entropy term interacts with language-specific tokenizers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our work, the clear summary of our contributions, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

Derivation is self-contained via standard chain rule on autoregressive entropy

full rationale

The central derivation applies the chain rule to obtain H(y|x) = sum_t E_{y_<t ~ p}[H(y_t | y_<t, x)], then differentiates the resulting objective with respect to model parameters. This produces an exact token-level entropy term plus a policy-gradient term arising from the parameter-dependent sampling distribution; both follow directly from the definition of entropy and the autoregressive factorization without invoking prior TTA heuristics. Prior methods are recovered post-derivation by dropping terms or fixing conditioning, which is a consequence rather than an input to the math. No self-citation load-bearing step, fitted-input prediction, or ansatz smuggling is present in the described chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on an exact mathematical decomposition of entropy minimization under the autoregressive factorization; the abstract provides no explicit free parameters, axioms, or invented entities, but the derivation necessarily assumes standard properties of entropy and conditional probability that are not detailed here.

pith-pipeline@v0.9.0 · 5437 in / 1089 out tokens · 48349 ms · 2026-05-12T00:45:00.209736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

    Introduction Test-Time Adaptation (TTA) has emerged as a promising paradigm for addressing the distribution shifts of real-world data at inference time. By adapting the source model using only test data, TTA enhances robustness without requiring access to the original training distribution. Among the various TTA types, episodic TTA [1] focuses on theone-s...

  2. [2]

    The foun- dational success of EM-based TTA is largely demonstrated in tasks where the model’s predictive distribution is treated as a set of independent categorical variables

    Related Work Entropy minimization has established itself as a fundamental building block for TTA across diverse modalities. The foun- dational success of EM-based TTA is largely demonstrated in tasks where the model’s predictive distribution is treated as a set of independent categorical variables. In computer vision, TENT [2] pioneers optimizing batch-no...

  3. [3]

    provides the first systematic comparison between these two heuristics under a post-training setting, highlighting the empiri- cal differences. The divide between these two strategies in prior work cre- ates ambiguity and leaves the theoretical connection between the heuristics and the true entropy minimization objective un- clear. We resolve this ambiguit...

  4. [4]

    In Section 3.2 and 3.3, we present our primary theoretical contribution of the derivation of the mathematically complete gradient expression of EM for autoregressive models

    Method We first establish the formal notation for autoregressive gener- ation and the entropy estimation in Section 3.1. In Section 3.2 and 3.3, we present our primary theoretical contribution of the derivation of the mathematically complete gradient expression of EM for autoregressive models. Section 3.4 elucidates the theoretical linkage between our uni...

  5. [5]

    Datasets We focus our empirical validation on the ASR task

    Experiments 4.1. Datasets We focus our empirical validation on the ASR task. We experi- ment on three datasets: Corrupted Librispeech (LS-C) [24]:The dataset is con- structed by adding noises from MS-SNSD into Librispeech test-otherset. The noises include air conditioner (AC), airport announcement (AA), babble (BA), copy machine (CM), munching (MU), neigh...

  6. [6]

    athlete” as “a fleet

    Discussion 5.1. Components in Token-level Objective To investigate the interaction between the token-level policy gradient loss and the token-level entropy loss, we evaluate them across varying sample sizes (G∈ {1,4,16,64}) under three distinct domains: Gaussian Noise (LS-GS-10), Spanish accents (L2-Spanish), and Polish (MLS-Polish). We compare EM-tok aga...

  7. [7]

    We introduce the complete EM objective and unify previously fragmented heuristics within a principled framework

    Conclusion In this work, we resolve the theoretical ambiguity surrounding EM in TTA for autoregressive models by deriving a mathemat- ically correct gradient expression. We introduce the complete EM objective and unify previously fragmented heuristics within a principled framework. Through extensive evaluation on TTA for Whisper ASR across more than 20 di...

  8. [8]

    Acknowledgement This work was supported by the Ministry of Education (MOE) of Taiwan under the project Taiwan Centers of Excellence in Artificial Intelligence, through the NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)”. We thank the National Center for High-performance Com- puting (NCHC) of the National Applied Research Laborator...

  9. [9]

    The authors remain solely responsible for the re- search design, experiments, analysis, and reported results

    Generative AI Use Disclosure Generative AI tools assisted in the linguistic polishing of the manuscript. The authors remain solely responsible for the re- search design, experiments, analysis, and reported results. AI tools did not contribute to the substantive scientific content

  10. [10]

    Memo: test time robustness via adaptation and augmentation,

    M. Zhang, S. Levine, and C. Finn, “Memo: test time robustness via adaptation and augmentation,” ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  11. [11]

    Tent: Fully test-time adaptation by entropy minimization,

    D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inIn- ternational Conference on Learning Representations, 2020

  12. [12]

    Efficient test-time model adaptation without forgetting,

    S. Niu, J. Wu, Y . Zhang, Y . Chen, S. Zheng, P. Zhao, and M. Tan, “Efficient test-time model adaptation without forgetting,” inIn- ternational conference on machine learning. PMLR, 2022, pp. 16 888–16 905

  13. [13]

    Continual test-time domain adaptation,

    Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7201– 7211

  14. [14]

    The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

    S. Agarwal, Z. Zhang, L. Yuan, J. Han, and H. Peng, “The unrea- sonable effectiveness of entropy minimization in llm reasoning,” arXiv preprint arXiv:2505.15134, 2025

  15. [15]

    One- shot entropy minimization,

    Z. Gao, L. Chen, H. Luo, J. Zhou, and B. Dai, “One- shot entropy minimization,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.20282

  16. [16]

    Slot: Sample-specific language model optimization at test-time,

    Y . Hu, X. Zhang, X. Fang, Z. Chen, X. Wang, H. Zhang, and G. Qi, “Slot: Sample-specific language model optimization at test-time,”CoRR, vol. abs/2505.12392, May 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.12392

  17. [17]

    Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition,

    G.-T. Lin, S.-W. Li, and H. yi Lee, “Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition,” inProc. Interspeech 2022, 2022, pp. 2198– 2202

  18. [18]

    SGEM: Test-Time Adap- tation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization,

    C. Kim, J. Park, H. Shim, and E. Yang, “SGEM: Test-Time Adap- tation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization,” inProc. INTERSPEECH 2023, 2023, pp. 3367–3371

  19. [19]

    Advancing test-time adaptation in wild acoustic test settings,

    H. Liu, H. Huang, and Y . Wang, “Advancing test-time adaptation in wild acoustic test settings,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 7138–7155. [Online]. Available: https://ac...

  20. [20]

    Slm-tta: A framework for test-time adaptation of generative spoken language models,

    Y .-K. Wu, Y . Liu, Y . Huang, Z. Yang, H. Wu, R. Huang, Yi-Te, Hsu, S. Kong, M. Sun, F. Metze, and L. Wan, “Slm-tta: A framework for test-time adaptation of generative spoken language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2512.24739

  21. [21]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large- scale weak supervision,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.04356

  22. [22]

    Towards stable test-time adaptation in dynamic wild world,

    S. Niu, J. Wu, Y . Zhang, Z. Wen, Y . Chen, P. Zhao, and M. Tan, “Towards stable test-time adaptation in dynamic wild world,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=g2YraF75Tj

  23. [23]

    Li-tta: Language informed test-time adaptation for auto- matic speech recognition,

    E. Yoon, H. S. Yoon, J. Harvill, M. Hasegawa-Johnson, and C. D. Yoo, “Li-tta: Language informed test-time adaptation for auto- matic speech recognition,” inInterspeech 2024, 2024, pp. 3490– 3494

  24. [24]

    Suta-lm: Bridging test-time adaptation and language model rescoring for robust asr,

    W.-P. Huang, G.-T. Lin, and H. yi Lee, “Suta-lm: Bridging test-time adaptation and language model rescoring for robust asr,”

  25. [25]

    Available: https://arxiv.org/abs/2506.11121

    [Online]. Available: https://arxiv.org/abs/2506.11121

  26. [26]

    wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neur...

  27. [27]

    Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition,

    M. Burchi and V . Vielzeuf, “Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition,” 2021. [Online]. Available: https://arxiv.org/abs/ 2109.01163

  28. [28]

    Right ques- tion is already half the answer: Fully unsupervised llm reasoning incentivization,

    Q. Zhang, H. Wu, C. Zhang, P. Zhao, and Y . Bian, “Right ques- tion is already half the answer: Fully unsupervised llm reasoning incentivization,”Advances in neural information processing sys- tems, 2025

  29. [29]

    Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

    M. Chen, G. Chen, W. Wang, and Y . Yang, “Seed-grpo: Seman- tic entropy enhanced grpo for uncertainty-aware policy optimiza- tion,”arXiv preprint arXiv:2505.12346, 2025

  30. [30]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=VD-AYtP0dve

  31. [31]

    , title =

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Mach. Learn., vol. 8, no. 3–4, p. 229–256, May 1992. [Online]. Available: https: //doi.org/10.1007/BF00992696

  32. [32]

    Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs,

    A. Ahmadian, C. Cremer, M. Gall ´e, M. Fadaee, J. Kreutzer, O. Pietquin, A. ¨Ust¨un, and S. Hooker, “Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Ed...

  33. [33]

    DAPO : Improving multi-step reasoning abilities of large language models with direct advantage-based policy optimization,

    J. Liu, C. Wang, C. Y . Liu, L. Zeng, R. Yan, Y . Sun, and Y . Liu, “DAPO : Improving multi-step reasoning abilities of large language models with direct advantage-based policy optimization,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=77eEDRhPkQ

  34. [34]

    Continual test-time adaptation for end-to-end speech recognition on noisy speech,

    G.-T. Lin, W. P. Huang, and H.-y. Lee, “Continual test-time adaptation for end-to-end speech recognition on noisy speech,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 20 003–20 015. [...

  35. [35]

    L2-arctic: A non- native english speech corpus,

    G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev- Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-arctic: A non- native english speech corpus,” inInterspeech 2018, 2018, pp. 2783–2787

  36. [36]

    Pratap, Q

    V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,”ArXiv, vol. abs/2012.03411, 2020

  37. [37]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215

  38. [38]

    The Special Case of Non-Autoregressive ASR Our framework also provides a theoretical perspective on ear- lier TTA methods for non-autoregressive ASR [8]

    Appendix 10.1. The Special Case of Non-Autoregressive ASR Our framework also provides a theoretical perspective on ear- lier TTA methods for non-autoregressive ASR [8]. In non- autoregressive models, the output at each frame is independent of other frames. In this setting, the token-level entropy estima- tor, ˆHtok(y), does not depend on a particular outp...

  39. [39]

    First, the require- ment for multiple samples to ensure accurate gradient estima- tion increases the computational overhead during the adaptation phase

    Limitation and Future Work Despite the theoretical and empirical advantages of our unified EM objective, several limitations remain. First, the require- ment for multiple samples to ensure accurate gradient estima- tion increases the computational overhead during the adaptation phase. Compared to the teacher-forcing heuristic, which only requires a single...