Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification

Elizabeth G Campolongo; Hilmar Lapp; Jianyang Gu; Matthew J Thompson; Nathan Jacobs; Net Zhang; Srikumar Sastry; Tanya Berger-Wolf; Wei-Lun Chao; Yu Su

arxiv: 2606.21838 · v1 · pith:IC3RJRNAnew · submitted 2026-06-20 · 💻 cs.CV · cs.LG

Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification

Zhiyuan Tao , Srikumar Sastry , Matthew J Thompson , Elizabeth G Campolongo , Net Zhang , Ziheng Zhang , Hilmar Lapp , Yu Su

show 4 more authors

Tanya Berger-Wolf Nathan Jacobs Wei-Lun Chao Jianyang Gu

This is my paper

classification 💻 cs.CV cs.LG

keywords classificationhierarchicalcontrastivetaxonomicacrosslevelsmodelaccuracy

0 comments

read the original abstract

Multimodal contrastive learning has enabled zero-shot visual classification by aligning images with textual categories. However, in hierarchically structured label spaces, existing methods often produce predictions that are inconsistent across taxonomic levels. For example, a model may predict a fine-grained category whose parent category contradicts its simultaneously predicted higher-level label. By analysis, the issue originates from false negative labels when contrastive comparison involves multiple taxonomic levels. To this end, we propose to restrict contrastive comparisons to categories within the same taxonomic level. In addition, we adopt a group-balanced design, ensuring each taxonomic level receives adequate optimization. As a result, the proposed framework improves both hierarchical consistency and classification accuracy from coarse to fine granularity. We train our model with TreeOfLife-10M based on BioCLIP and evaluate it across multiple hierarchical classification benchmarks, where the model demonstrates significantly improved hierarchical consistency in both Euclidean and hyperbolic spaces. Notably, on iNaturalist 2021 (iNat21), our method improves average accuracy across levels by 30.47% over the baseline, highlighting its effectiveness for hierarchical zero-shot classification.

This paper has not been read by Pith yet.

Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification

discussion (0)