LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories
Pith reviewed 2026-05-10 16:21 UTC · model grok-4.3
The pith
A linear probe on how class logits evolve across the final layers of a Vision Transformer can predict when its classification is wrong.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The trajectories of class logits and the instability of top-ranked classes across the last L layers of a ViT supply a usable signal for whether the final prediction is an error, which can be extracted via simple auxiliary heads and fed to a linear classifier.
What carries the argument
LogitDynamics features: the logits of the predicted class and top-K competitors plus instability statistics of top-ranked classes, collected from the last L layers through attached linear heads and used as input to an error-predicting linear probe.
If this is right
- Error detection requires only a single forward pass plus lightweight heads and a probe.
- AUCPR matches or exceeds that of standard confidence baselines on the evaluated datasets.
- Cross-dataset transfer of the error predictor is stronger than for competing methods.
- The approach exploits internal depth-wise signals in a manner parallel to hallucination detection in language models.
Where Pith is reading between the lines
- ViTs appear to accumulate and sometimes revise class evidence progressively through later layers instead of committing to a decision only at the output.
- The same layerwise logit tracking could be tested on convolutional networks or other sequential architectures to check whether comparable instability signals appear outside transformers.
- In settings with distribution shift, combining these dynamics with final-layer confidence might yield more robust uncertainty estimates than either alone.
- Replacing the linear probe with a small nonlinear model or selecting layers adaptively could be explored to see whether further gains are available without losing the method's simplicity.
Load-bearing premise
The logit values and instability statistics from the last L layers encode a signal about misclassifications that generalizes across datasets and is not created by the auxiliary heads.
What would settle it
Training the error probe on features from one dataset such as ImageNet and finding that its AUCPR on a disjoint domain such as medical images falls to the level of a random guess or below the strongest baseline would falsify the claim of a generalizable logit-trajectory signal.
Figures
read the original abstract
Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier's output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LogitDynamics for error detection in Vision Transformers: lightweight linear heads are attached to intermediate layers to extract features from the last L layers, including logits of the predicted class and its top-K competitors plus depth-wise instability statistics of top-ranked classes. A linear probe is then trained on these features to predict whether the ViT's classification is erroneous. The method is claimed to improve or match AUCPR over baselines across datasets while showing stronger cross-dataset generalization and requiring minimal extra computation, motivated by internal-signal approaches in LLMs.
Significance. If the central claims hold with proper controls, the work could offer a practical, low-overhead technique for post-hoc error detection in ViT classifiers by leveraging layerwise logit trajectories. This would be valuable for reliable deployment in safety-critical vision applications and could bridge ideas from LLM hallucination detection to vision models. The emphasis on cross-dataset transfer and minimal computation is a potential strength if the features prove intrinsic rather than artifactual.
major comments (3)
- [Abstract] Abstract: the claim of 'stronger cross-dataset generalization' is load-bearing for the paper's contribution, yet the description does not specify whether the auxiliary linear heads are frozen from the source training distribution or retrained on target data when evaluating transfer. If retrained, the extracted top-K logits and instability statistics (e.g., rank changes across depth) necessarily encode target-specific class boundaries, making apparent AUCPR gains and generalization an artifact of head adaptation rather than a property of ViT logit trajectories.
- [Method] Method (feature extraction paragraph): the instability statistics are described only at a high level ('statistics describing instability of top-ranked classes across depth') with no explicit equations or definitions (e.g., whether variance of logit values, rank-flip counts, or normalized entropy). Without these, it is impossible to determine if the features are reproducible or if they reduce to quantities already captured by standard baselines such as maximum softmax probability.
- [Results] Results (cross-dataset experiments): the abstract asserts AUCPR improvements and stronger generalization, but the manuscript must report the exact numerical deltas, the precise baseline definitions (e.g., MSP, temperature scaling, or other logit-based detectors), the value of L and K, and whether any statistical tests (e.g., paired t-tests or bootstrap confidence intervals) support the 'improves or matches' statement. Absent these, the central empirical claim cannot be assessed.
minor comments (2)
- [Abstract] The abstract and introduction should explicitly cite the specific LLM hallucination-detection papers that motivate the layerwise approach, including any quantitative parallels drawn.
- [Method] Notation for the linear probe and the error indicator variable should be introduced with equations rather than prose to avoid ambiguity when describing the training objective.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'stronger cross-dataset generalization' is load-bearing for the paper's contribution, yet the description does not specify whether the auxiliary linear heads are frozen from the source training distribution or retrained on target data when evaluating transfer. If retrained, the extracted top-K logits and instability statistics (e.g., rank changes across depth) necessarily encode target-specific class boundaries, making apparent AUCPR gains and generalization an artifact of head adaptation rather than a property of ViT logit trajectories.
Authors: We agree that this clarification is essential. In the cross-dataset experiments, the auxiliary linear heads are trained on the source dataset and kept frozen for the target dataset. The linear probe is likewise trained using source features only. This design ensures the extracted features capture intrinsic ViT logit trajectories rather than target-specific adaptations. We have revised the abstract and the method section to explicitly describe this experimental protocol. revision: yes
-
Referee: [Method] Method (feature extraction paragraph): the instability statistics are described only at a high level ('statistics describing instability of top-ranked classes across depth') with no explicit equations or definitions (e.g., whether variance of logit values, rank-flip counts, or normalized entropy). Without these, it is impossible to determine if the features are reproducible or if they reduce to quantities already captured by standard baselines such as maximum softmax probability.
Authors: We acknowledge the need for greater precision in describing the instability statistics. In the revised version, we will include explicit equations defining these features. These will be accompanied by pseudocode to facilitate reproducibility and to clarify how they differ from single-layer baselines like maximum softmax probability. revision: yes
-
Referee: [Results] Results (cross-dataset experiments): the abstract asserts AUCPR improvements and stronger generalization, but the manuscript must report the exact numerical deltas, the precise baseline definitions (e.g., MSP, temperature scaling, or other logit-based detectors), the value of L and K, and whether any statistical tests (e.g., paired t-tests or bootstrap confidence intervals) support the 'improves or matches' statement. Absent these, the central empirical claim cannot be assessed.
Authors: We will expand the results section to provide a table with exact AUCPR values for LogitDynamics and all compared baselines, including clear definitions of MSP and other methods. The values of L and K used in the experiments will be stated, along with any statistical tests such as bootstrap confidence intervals to substantiate the performance claims. revision: yes
Circularity Check
No circularity: independent linear probe trained on extracted logit features
full rationale
The paper's core method attaches auxiliary linear heads to intermediate ViT layers, extracts logit values for the predicted class plus top-K competitors along with depth-wise instability statistics from the last L layers, and trains a separate linear probe on these features to predict the binary error indicator. This is a standard supervised feature-based classifier with no equations or definitions that reduce the error prediction to the input features or fitted quantities by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided derivation. Cross-dataset generalization is presented as an empirical outcome rather than a definitional necessity, and the approach remains self-contained against external benchmarks without renaming known results or smuggling assumptions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jonathan Aigrain and Marcin Detyniecki. Detecting adver- sarial examples and other misclassifications in neural net- works by introspection.arXiv preprint arXiv:1905.09186,
-
[2]
The internal state of an LLM knows when it’s lying
Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967– 976, Singapore, 2023. Association for Computational Lin- guistics. 1
work page 2023
-
[3]
Beyond token probes: Halluci- nation detection via activation tensors with ACT-ViT
Guy Bar-Shalom, Fabrizio Frasca, Yaniv Galron, Yftah Ziser, and Haggai Maron. Beyond token probes: Halluci- nation detection via activation tensors with ACT-ViT. InAd- vances in Neural Information Processing Systems 38, 2025. 3
work page 2025
-
[4]
Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gel- berg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, and Haggai Maron. Beyond next token probabilities: Learnable, fast detection of hallucinations and data contamination on LLM output distributions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(36):30058–30066, 2026. 1
work page 2026
-
[5]
Addressing failure prediction by learning model confidence
Charles Corbi `ere, Nicolas Thom´e, Avner Bar-Hen, Matthieu Cord, and Patrick P ´erez. Addressing failure prediction by learning model confidence. InAdvances in Neural Informa- tion Processing Systems 32, pages 2902–2913, 2019. 1
work page 2019
-
[6]
ImageNet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2
work page 2009
-
[7]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 2
work page 2021
-
[8]
Leo Feng, Mohamed Osama Ahmed, Hossein Hajimir- sadeghi, and Amir H. Abdi. Towards better selective clas- sification. InInternational Conference on Learning Repre- sentations, 2023. 1
work page 2023
-
[9]
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Confer- ence on Machine Learning, pages 1050–1059. PMLR, 2016. 1
work page 2016
-
[10]
A data-driven measure of rel- ative uncertainty for misclassification detection
Eduardo Dadalto C ˆamara Gomes, Marco Romanelli, Georg Pichler, and Pablo Piantanida. A data-driven measure of rel- ative uncertainty for misclassification detection. InInterna- tional Conference on Learning Representations, 2024. 1
work page 2024
-
[11]
A baseline for detect- ing misclassified and out-of-distribution examples in neural networks
Dan Hendrycks and Kevin Gimpel. A baseline for detect- ing misclassified and out-of-distribution examples in neural networks. InInternational Conference on Learning Repre- sentations, 2017. 1
work page 2017
-
[12]
Shallow-deep networks: Understanding and mitigating net- work overthinking
Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-deep networks: Understanding and mitigating net- work overthinking. InProceedings of the 36th International Conference on Machine Learning, pages 3301–3310. PMLR,
-
[13]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 2
work page 2009
-
[14]
Simple and scalable predictive uncertainty esti- mation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty esti- mation using deep ensembles. InAdvances in Neural Infor- mation Processing Systems 30, pages 6402–6413, 2017. 1
work page 2017
-
[15]
A simple unified framework for detecting out-of-distribution samples and adversarial attacks
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InAdvances in Neural In- formation Processing Systems 31, pages 7167–7177, 2018. 1, 3
work page 2018
-
[16]
Selective classification under distribution shifts.Transactions on Machine Learning Research, 2024
Hengyue Liang, Le Peng, and Ju Sun. Selective classification under distribution shifts.Transactions on Machine Learning Research, 2024. 1
work page 2024
-
[17]
Energy-based out-of-distribution detection
Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. InAdvances in Neural Information Processing Systems 33, pages 21464– 21475, 2020. 1, 3
work page 2020
-
[18]
LLMs know more than they show: On the intrinsic representation of LLM hallucinations
Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. InInternational Conference on Learn- ing Representations, 2025. 1
work page 2025
-
[19]
Bolei Zhou, `Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 40(6):1452–1464, 2018. 2 A. Appendix A.1. Additional Cross-Dataset Results ACT-ViT mirrors the behavior of linear probing, demon- strating poor cross-data...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.