When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy
Pith reviewed 2026-05-08 13:00 UTC · model grok-4.3
The pith
Incorporating a class hierarchy into the loss function improves image classification accuracy over standard cross-entropy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture-dataset pairs, with a mean gain of 4.66%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18% over the next best baseline. The method combines prediction aggregation, which propagates probability mass upward through the class hierarchy, and ancestral label smoothing, which distributes the ground-truth signal along ancestry paths.
What carries the argument
Hierarchy-Aware Cross-Entropy (HACE) that integrates prediction aggregation to accumulate parent-node confidence from children and ancestral label smoothing to distribute ground-truth probability along the path to the root.
If this is right
- HACE functions as a drop-in replacement for cross-entropy and requires no change to model architecture.
- Accuracy gains appear consistently across convolutional and attention-based networks on CIFAR-100, FGVC Aircraft, and NABirds.
- The same loss also improves linear probes on frozen pre-trained features from DINOv2-Large.
- By respecting semantic distances, the trained models make fewer errors between unrelated classes.
Where Pith is reading between the lines
- The same loss formulation could be applied to hierarchical label sets outside vision, such as product taxonomies or medical diagnosis codes.
- Automatically inferring or refining the hierarchy from data might extend the benefits to datasets that lack an explicit tree.
- Combining HACE with existing regularization methods could produce additive gains in generalization.
- Scaling the approach to ImageNet-scale hierarchies would test whether the observed improvements hold when the tree becomes deeper and wider.
Load-bearing premise
The supplied class hierarchy accurately encodes the semantic distances that matter for distinguishing the classes in the task.
What would settle it
Training the same architectures on the same datasets with an independently verified accurate hierarchy and finding that HACE produces equal or lower accuracy than standard cross-entropy on average would falsify the central claim.
Figures
read the original abstract
Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model's probability mass upward through the class hierarchy to ensure that parent nodes accumulate the confidence of their children; and ancestral label smoothing, which distributes the ground-truth signal along the path from the true class to the root. We evaluate HACE on CIFAR-100, FGVC Aircraft, and NABirds in two regimes: end-to-end training across six architectures spanning convolutional and attention-based designs, and linear probing on frozen DINOv2-Large features. In end-to-end training, HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture--dataset pairs, with a mean gain of 4.66\%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18\% over the next best baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hierarchy-Aware Cross-Entropy (HACE) as a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy via two components: prediction aggregation (upward propagation of model probabilities to parent nodes) and ancestral label smoothing (distributing ground-truth probability mass along the ancestry path to the root). It reports consistent accuracy gains over vanilla cross-entropy in end-to-end training on CIFAR-100, FGVC Aircraft, and NABirds across six architectures (15/18 pairs, mean +4.66%), and superior results in linear probing on frozen DINOv2-Large features (mean +2.18% over the next-best baseline).
Significance. If the empirical results hold under full scrutiny, HACE provides a simple, hierarchy-aware loss that could be adopted as a default when class taxonomies are available, particularly for fine-grained datasets. The gains in both full training and linear-probing regimes, plus the method's parameter-free nature relative to the hierarchy, represent a modest but practical advance over treating all misclassifications equally.
major comments (2)
- [§4] §4 (Experiments), Table 1: the reported mean gain of 4.66% aggregates across 18 pairs without per-pair standard deviations or paired statistical tests; several individual improvements appear small enough that run-to-run variance could alter the 15/18 count.
- [§3.2] §3.2 (Ancestral label smoothing): the smoothing distributes mass uniformly along the path to the root, but no ablation is shown on alternative weightings (e.g., exponential decay by depth) or on the sensitivity of final accuracy to the smoothing coefficient; this choice is load-bearing for the claimed generalization benefit.
minor comments (2)
- The abstract and §4 claim outperformance over 'all competing methods' in linear probing, but the exact list of baselines and their hyper-parameter tuning protocols should be stated explicitly for reproducibility.
- [§2] Notation for the hierarchy (parent/child relations, depth) is introduced in §2 but used without a small illustrative diagram; adding one would clarify the upward aggregation step.
Simulated Author's Rebuttal
We thank the referee for the thorough review and the recommendation for minor revision. We address each of the major comments below and outline the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments), Table 1: the reported mean gain of 4.66% aggregates across 18 pairs without per-pair standard deviations or paired statistical tests; several individual improvements appear small enough that run-to-run variance could alter the 15/18 count.
Authors: We agree that including standard deviations and statistical tests would enhance the rigor of our empirical evaluation. Due to the significant computational resources required to train six different architectures on three datasets, we performed single runs for each experiment. Nevertheless, the improvements are consistent across 15 out of 18 diverse settings, with several gains being substantial (over 5% in multiple cases). In the revised manuscript, we will expand Table 1 to list the per-pair accuracy differences explicitly and add a paragraph in §4 discussing the single-run limitation and the robustness suggested by the breadth of our experiments. We will also note that future work could include multi-seed evaluations for formal statistical testing. revision: partial
-
Referee: [§3.2] §3.2 (Ancestral label smoothing): the smoothing distributes mass uniformly along the path to the root, but no ablation is shown on alternative weightings (e.g., exponential decay by depth) or on the sensitivity of final accuracy to the smoothing coefficient; this choice is load-bearing for the claimed generalization benefit.
Authors: The uniform distribution was deliberately chosen to ensure the method remains simple, hyperparameter-free, and a true drop-in replacement for cross-entropy. We will revise the text in §3.2 to provide a clearer justification for this design decision, emphasizing its alignment with the goal of incorporating hierarchy without additional complexity. To address the sensitivity concern, we will include in the supplementary material a plot or table showing accuracy as a function of the smoothing coefficient on CIFAR-100 for one architecture. Regarding alternative weightings, we will add a discussion acknowledging that non-uniform schemes (such as depth-based decay) could be explored in future work and may yield further improvements, but that the uniform approach already delivers consistent gains. revision: partial
Circularity Check
No significant circularity in HACE derivation
full rationale
The paper defines Hierarchy-Aware Cross-Entropy directly as a combination of prediction aggregation (upward probability propagation through the given hierarchy) and ancestral label smoothing (distributing ground-truth along ancestry paths), both constructed from the supplied class hierarchy and standard cross-entropy. No load-bearing equation reduces to a fitted parameter renamed as a prediction, no self-citation chain justifies a uniqueness claim, and no ansatz is smuggled in. The reported accuracy gains (15/18 pairs, mean +4.66%) are presented as empirical outcomes rather than mathematical derivations that collapse to the inputs by construction. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A class hierarchy is available and correctly represents semantic relationships between classes
Reference graph
Works this paper leans on
-
[1]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998
work page 1998
-
[2]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012
work page 2012
-
[3]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[4]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR
work page 2015
-
[5]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014
work page 1929
-
[6]
A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019
Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019
work page 2019
-
[7]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021
work page 2021
-
[8]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
work page 2021
-
[9]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022
work page 2022
-
[10]
S. Cultrera di Montesano, D. D’Ascenzo, S. Raghavan, A.P. Amini, P.S. Winter, and L. Crawford. Improving atlas-scale single-cell annotation models with hierarchical cross-entropy loss.Nature Computational Science, 6:243–249, 2026
work page 2026
-
[11]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
work page 2009
-
[12]
Fine- grained visual classification of aircraft, 2013
Subhransu Maji, Juho Kannala, Esa Rahtu, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft, 2013
work page 2013
-
[13]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 595–604, 2015. 10
work page 2015
-
[14]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...
work page 2024
-
[15]
Making better mistakes: Leveraging class hierarchies with deep networks
Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, and Nicholas A Lord. Making better mistakes: Leveraging class hierarchies with deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12506–12515, 2020
work page 2020
-
[16]
Carlos N Silla Jr and Alex A Freitas. A survey of hierarchical classification across different application domains.Data mining and knowledge discovery, 22(1):31–72, 2011
work page 2011
-
[17]
Jia Deng, Alexander C. Berg, Kai Li, and Li Fei-Fei. What does classifying more than 10,000 image categories tell us? In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision – ECCV 2010, pages 71–84. Springer Berlin Heidelberg, 2010. ISBN 978-3- 642-15555-0
work page 2010
-
[18]
Re- thinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016
work page 2016
-
[19]
When does label smoothing help? Advances in neural information processing systems, 32, 2019
Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019
work page 2019
-
[20]
Simloss: Class similarities in cross entropy
Konstantin Kobs, Michael Steininger, Albin Zehe, Florian Lautenschlager, and Andreas Hotho. Simloss: Class similarities in cross entropy. InF oundations of Intelligent Systems: 25th International Symposium, ISMIS 2020, Graz, Austria, September 23–25, 2020, Proceedings, page 431–439. Springer-Verlag, 2020
work page 2020
-
[21]
Human uncertainty makes classification more robust
Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF international conference on computer vision, pages 9617–9626, 2019
work page 2019
-
[22]
Hierarchy-based image embeddings for semantic image retrieval
Björn Barz and Joachim Denzler. Hierarchy-based image embeddings for semantic image retrieval. In2019 IEEE winter conference on applications of computer vision (WACV), pages 638–647. IEEE, 2019
work page 2019
-
[23]
Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempit- sky. Hyperbolic image embeddings. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6418–6428, 2020. A Appendix A.1 Extension to directed acyclic graphs The description of HACE in Section 3 assumes that the class hierarchy ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.