pith. sign in

arxiv: 2605.06274 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CV

When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy

Pith reviewed 2026-05-08 13:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords hierarchy-aware cross-entropyimage classificationclass hierarchylabel smoothingprediction aggregationCIFAR-100FGVC AircraftNABirds
0
0 comments X

The pith

Incorporating a class hierarchy into the loss function improves image classification accuracy over standard cross-entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hierarchy-Aware Cross-Entropy (HACE) to replace standard cross-entropy by directly using a known class hierarchy. HACE aggregates model predictions upward so that parent classes receive probability mass from their children and applies ancestral label smoothing to spread the true label signal along the path from the correct class to the root. This approach yields higher accuracy than standard cross-entropy in 15 of 18 architecture-dataset combinations during end-to-end training and beats all tested baselines during linear probing on frozen features. A sympathetic reader would care because everyday classification problems involve classes that share semantic structure, and ignoring that structure forces models to treat every error as equally bad.

Core claim

HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture-dataset pairs, with a mean gain of 4.66%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18% over the next best baseline. The method combines prediction aggregation, which propagates probability mass upward through the class hierarchy, and ancestral label smoothing, which distributes the ground-truth signal along ancestry paths.

What carries the argument

Hierarchy-Aware Cross-Entropy (HACE) that integrates prediction aggregation to accumulate parent-node confidence from children and ancestral label smoothing to distribute ground-truth probability along the path to the root.

If this is right

  • HACE functions as a drop-in replacement for cross-entropy and requires no change to model architecture.
  • Accuracy gains appear consistently across convolutional and attention-based networks on CIFAR-100, FGVC Aircraft, and NABirds.
  • The same loss also improves linear probes on frozen pre-trained features from DINOv2-Large.
  • By respecting semantic distances, the trained models make fewer errors between unrelated classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loss formulation could be applied to hierarchical label sets outside vision, such as product taxonomies or medical diagnosis codes.
  • Automatically inferring or refining the hierarchy from data might extend the benefits to datasets that lack an explicit tree.
  • Combining HACE with existing regularization methods could produce additive gains in generalization.
  • Scaling the approach to ImageNet-scale hierarchies would test whether the observed improvements hold when the tree becomes deeper and wider.

Load-bearing premise

The supplied class hierarchy accurately encodes the semantic distances that matter for distinguishing the classes in the task.

What would settle it

Training the same architectures on the same datasets with an independently verified accurate hierarchy and finding that HACE produces equal or lower accuracy than standard cross-entropy on average would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06274 by April Chan, Davide D'Ascenzo, Sebastiano Cultrera di Montesano.

Figure 1
Figure 1. Figure 1: Illustration of the two components of HACE applied to a toy animal hierarchy. view at source ↗
Figure 2
Figure 2. Figure 2: Per-class accuracy at the family level of the FGVC Aircraft hierarchy, comparing HACE and view at source ↗
Figure 3
Figure 3. Figure 3: Per-class accuracy at the manufacturer level of the FGVC Aircraft hierarchy, comparing view at source ↗
read the original abstract

Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model's probability mass upward through the class hierarchy to ensure that parent nodes accumulate the confidence of their children; and ancestral label smoothing, which distributes the ground-truth signal along the path from the true class to the root. We evaluate HACE on CIFAR-100, FGVC Aircraft, and NABirds in two regimes: end-to-end training across six architectures spanning convolutional and attention-based designs, and linear probing on frozen DINOv2-Large features. In end-to-end training, HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture--dataset pairs, with a mean gain of 4.66\%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18\% over the next best baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Hierarchy-Aware Cross-Entropy (HACE) as a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy via two components: prediction aggregation (upward propagation of model probabilities to parent nodes) and ancestral label smoothing (distributing ground-truth probability mass along the ancestry path to the root). It reports consistent accuracy gains over vanilla cross-entropy in end-to-end training on CIFAR-100, FGVC Aircraft, and NABirds across six architectures (15/18 pairs, mean +4.66%), and superior results in linear probing on frozen DINOv2-Large features (mean +2.18% over the next-best baseline).

Significance. If the empirical results hold under full scrutiny, HACE provides a simple, hierarchy-aware loss that could be adopted as a default when class taxonomies are available, particularly for fine-grained datasets. The gains in both full training and linear-probing regimes, plus the method's parameter-free nature relative to the hierarchy, represent a modest but practical advance over treating all misclassifications equally.

major comments (2)
  1. [§4] §4 (Experiments), Table 1: the reported mean gain of 4.66% aggregates across 18 pairs without per-pair standard deviations or paired statistical tests; several individual improvements appear small enough that run-to-run variance could alter the 15/18 count.
  2. [§3.2] §3.2 (Ancestral label smoothing): the smoothing distributes mass uniformly along the path to the root, but no ablation is shown on alternative weightings (e.g., exponential decay by depth) or on the sensitivity of final accuracy to the smoothing coefficient; this choice is load-bearing for the claimed generalization benefit.
minor comments (2)
  1. The abstract and §4 claim outperformance over 'all competing methods' in linear probing, but the exact list of baselines and their hyper-parameter tuning protocols should be stated explicitly for reproducibility.
  2. [§2] Notation for the hierarchy (parent/child relations, depth) is introduced in §2 but used without a small illustrative diagram; adding one would clarify the upward aggregation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for minor revision. We address each of the major comments below and outline the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments), Table 1: the reported mean gain of 4.66% aggregates across 18 pairs without per-pair standard deviations or paired statistical tests; several individual improvements appear small enough that run-to-run variance could alter the 15/18 count.

    Authors: We agree that including standard deviations and statistical tests would enhance the rigor of our empirical evaluation. Due to the significant computational resources required to train six different architectures on three datasets, we performed single runs for each experiment. Nevertheless, the improvements are consistent across 15 out of 18 diverse settings, with several gains being substantial (over 5% in multiple cases). In the revised manuscript, we will expand Table 1 to list the per-pair accuracy differences explicitly and add a paragraph in §4 discussing the single-run limitation and the robustness suggested by the breadth of our experiments. We will also note that future work could include multi-seed evaluations for formal statistical testing. revision: partial

  2. Referee: [§3.2] §3.2 (Ancestral label smoothing): the smoothing distributes mass uniformly along the path to the root, but no ablation is shown on alternative weightings (e.g., exponential decay by depth) or on the sensitivity of final accuracy to the smoothing coefficient; this choice is load-bearing for the claimed generalization benefit.

    Authors: The uniform distribution was deliberately chosen to ensure the method remains simple, hyperparameter-free, and a true drop-in replacement for cross-entropy. We will revise the text in §3.2 to provide a clearer justification for this design decision, emphasizing its alignment with the goal of incorporating hierarchy without additional complexity. To address the sensitivity concern, we will include in the supplementary material a plot or table showing accuracy as a function of the smoothing coefficient on CIFAR-100 for one architecture. Regarding alternative weightings, we will add a discussion acknowledging that non-uniform schemes (such as depth-based decay) could be explored in future work and may yield further improvements, but that the uniform approach already delivers consistent gains. revision: partial

Circularity Check

0 steps flagged

No significant circularity in HACE derivation

full rationale

The paper defines Hierarchy-Aware Cross-Entropy directly as a combination of prediction aggregation (upward probability propagation through the given hierarchy) and ancestral label smoothing (distributing ground-truth along ancestry paths), both constructed from the supplied class hierarchy and standard cross-entropy. No load-bearing equation reduces to a fitted parameter renamed as a prediction, no self-citation chain justifies a uniqueness claim, and no ansatz is smuggled in. The reported accuracy gains (15/18 pairs, mean +4.66%) are presented as empirical outcomes rather than mathematical derivations that collapse to the inputs by construction. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a known class hierarchy exists and meaningfully captures semantic relationships; no free parameters or invented entities are introduced beyond the standard cross-entropy formulation.

axioms (1)
  • domain assumption A class hierarchy is available and correctly represents semantic relationships between classes
    HACE requires this hierarchy to perform prediction aggregation and ancestral label smoothing.

pith-pipeline@v0.9.0 · 5529 in / 1199 out tokens · 48449 ms · 2026-05-08T13:00:45.963374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  2. [2]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  3. [3]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  4. [4]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR

  5. [5]

    Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014

  6. [6]

    A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019

    Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  8. [8]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  9. [9]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  10. [10]

    Cultrera di Montesano, D

    S. Cultrera di Montesano, D. D’Ascenzo, S. Raghavan, A.P. Amini, P.S. Winter, and L. Crawford. Improving atlas-scale single-cell annotation models with hierarchical cross-entropy loss.Nature Computational Science, 6:243–249, 2026

  11. [11]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  12. [12]

    Fine- grained visual classification of aircraft, 2013

    Subhransu Maji, Juho Kannala, Esa Rahtu, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft, 2013

  13. [13]

    Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 595–604, 2015. 10

  14. [14]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  15. [15]

    Making better mistakes: Leveraging class hierarchies with deep networks

    Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, and Nicholas A Lord. Making better mistakes: Leveraging class hierarchies with deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12506–12515, 2020

  16. [16]

    A survey of hierarchical classification across different application domains.Data mining and knowledge discovery, 22(1):31–72, 2011

    Carlos N Silla Jr and Alex A Freitas. A survey of hierarchical classification across different application domains.Data mining and knowledge discovery, 22(1):31–72, 2011

  17. [17]

    Berg, Kai Li, and Li Fei-Fei

    Jia Deng, Alexander C. Berg, Kai Li, and Li Fei-Fei. What does classifying more than 10,000 image categories tell us? In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision – ECCV 2010, pages 71–84. Springer Berlin Heidelberg, 2010. ISBN 978-3- 642-15555-0

  18. [18]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  19. [19]

    When does label smoothing help? Advances in neural information processing systems, 32, 2019

    Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019

  20. [20]

    Simloss: Class similarities in cross entropy

    Konstantin Kobs, Michael Steininger, Albin Zehe, Florian Lautenschlager, and Andreas Hotho. Simloss: Class similarities in cross entropy. InF oundations of Intelligent Systems: 25th International Symposium, ISMIS 2020, Graz, Austria, September 23–25, 2020, Proceedings, page 431–439. Springer-Verlag, 2020

  21. [21]

    Human uncertainty makes classification more robust

    Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF international conference on computer vision, pages 9617–9626, 2019

  22. [22]

    Hierarchy-based image embeddings for semantic image retrieval

    Björn Barz and Joachim Denzler. Hierarchy-based image embeddings for semantic image retrieval. In2019 IEEE winter conference on applications of computer vision (WACV), pages 638–647. IEEE, 2019

  23. [23]

    Hyperbolic image embeddings

    Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempit- sky. Hyperbolic image embeddings. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6418–6428, 2020. A Appendix A.1 Extension to directed acyclic graphs The description of HACE in Section 3 assumes that the class hierarchy ...