pith. sign in

arxiv: 2505.05583 · v1 · pith:E4JOD2HCnew · submitted 2025-05-08 · 💻 cs.CL

KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification

Pith reviewed 2026-05-22 15:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords hierarchical text classificationknowledge graphszero-shot learninglarge language modelsretrieval-augmented generationtaxonomy
0
0 comments X

The pith

Knowledge graphs retrieved via RAG supply structured context that lets LLMs classify documents into deep taxonomies in a strict zero-shot setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KG-HTC to perform hierarchical text classification using LLMs without any training examples. It retrieves relevant subgraphs from knowledge graphs to give the model semantic information about labels at different levels. Experiments on three datasets show it beats baselines, with bigger gains at deeper hierarchy levels. This matters because real-world classification often lacks labels and has large, imbalanced label sets.

Core claim

KG-HTC integrates knowledge graphs into LLMs by retrieving subgraphs related to the input text using a Retrieval-Augmented Generation approach. This provides structured semantic context that enhances the LLM's ability to understand label semantics at various hierarchy levels. The method improves classification accuracy in strict zero-shot settings on the WoS, DBpedia, and Amazon datasets, particularly at deeper levels of the hierarchy.

What carries the argument

RAG-based retrieval of relevant subgraphs from knowledge graphs to augment LLM prompts with structured semantic context for label understanding.

If this is right

  • LLMs can handle larger label spaces in HTC without supervision or fine-tuning.
  • Performance improves especially for labels at deeper levels of the taxonomy.
  • The approach mitigates challenges from long-tail label distributions.
  • Structured knowledge integration addresses real-world HTC applications lacking annotated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar retrieval augmentation could extend to other zero-shot structured prediction tasks such as relation extraction.
  • Domain-specific knowledge graphs might yield further gains when the general-purpose graph coverage is sparse.
  • The method suggests a broader pattern where external structured data reduces the need for task-specific labeled examples in NLP.

Load-bearing premise

Subgraphs retrieved from general-purpose knowledge graphs via RAG supply sufficiently relevant and structured semantic context to improve an LLM's understanding of label semantics across hierarchy levels without any task-specific fine-tuning or labeled examples.

What would settle it

Running KG-HTC on a dataset where the knowledge graph has no or minimal coverage for the taxonomy labels and measuring whether it still outperforms plain LLM baselines.

Figures

Figures reproduced from arXiv: 2505.05583 by Afshin Khadangi, Christophe Zgrzendek, Igor Tchappi, Johannes Sedlmeir, Qianbo Zang.

Figure 1
Figure 1. Figure 1: An example of HTC from the Amazon Product Review dataset. However, in real-world applications, HTC often faces one or mul￾tiple out of the following three significant challenges. First, there may be a shortage of annotated data, particularly as the cost of manually labeling custom data at multiple hierarchical levels is prohibitively high [5]. This problem becomes even more severe in dynamic en￾vironments … view at source ↗
Figure 2
Figure 2. Figure 2: The overview pipeline of KG-HTC. posed Z-STC to propagate similarity scores up the hierarchy and leverage this propagated information to optimize classification for upper-level labels. The third method combines embedding models with LLM-based classification. Paletto et al. [25] introduced HiLA, where LLMs generate new label layers inserted into the bottom of the current taxonomy. Then, Paletto follows Z-ST… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the knowledge graph (tree) constructed from the multi-level taxonomy in the Amazon Product Review dataset. The red nodes represent labels in the first hierarchical level. The green nodes denote sub￾categories (second level) interconnected through parent-child relationships. And the yellow nodes correspond to finally fine-grained leaf categories in the third level. output text y ∼ LLM(x) us… view at source ↗
Figure 4
Figure 4. Figure 4: As the taxonomy deepens, KG-HTC exhibits a slower performance degradation on the WoS and Amazon datasets [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes KG-HTC, a zero-shot hierarchical text classification method that retrieves subgraphs from general-purpose knowledge graphs via RAG and injects them as structured semantic context into LLMs to improve label understanding across hierarchy levels. It reports significant outperformance over three baselines on the WoS, DBpedia, and Amazon datasets, with larger gains at deeper hierarchy levels, and releases code for reproducibility.

Significance. If the reported gains prove robust under strict zero-shot conditions with no information leakage, the approach could meaningfully advance zero-shot HTC by addressing large label spaces and long-tail distributions through external structured knowledge. The public code release at https://github.com/QianboZang/KG-HTC is a clear strength for reproducibility and follow-up work.

major comments (1)
  1. [Abstract and Experiments] Abstract and Experiments section: The headline claim of 'strict zero-shot setting' and substantial gains at deeper hierarchy levels rests on the assumption that RAG-retrieved subgraphs supply only external semantic context. For the DBpedia dataset, if the chosen KG (e.g., a DBpedia-style graph) contains the same category hierarchy or instance-level relations used as ground-truth labels, retrieval can surface direct label definitions or parent-child links, converting the method into implicit supervised lookup rather than zero-shot reasoning. No exclusion filters or disjoint-KG protocol is described to rule out this overlap.
minor comments (2)
  1. [Abstract] The abstract states outperformance on three datasets but provides no details on exact baselines, metrics, statistical significance tests, or error analysis; these should be summarized in the abstract or early in the experiments section for clarity.
  2. [Method] Notation for hierarchy levels and subgraph retrieval should be defined more explicitly (e.g., how depth is measured and how relevance scoring in RAG is performed) to aid replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the zero-shot integrity of our evaluation. We address the concern point by point below.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The headline claim of 'strict zero-shot setting' and substantial gains at deeper hierarchy levels rests on the assumption that RAG-retrieved subgraphs supply only external semantic context. For the DBpedia dataset, if the chosen KG (e.g., a DBpedia-style graph) contains the same category hierarchy or instance-level relations used as ground-truth labels, retrieval can surface direct label definitions or parent-child links, converting the method into implicit supervised lookup rather than zero-shot reasoning. No exclusion filters or disjoint-KG protocol is described to rule out this overlap.

    Authors: We appreciate this observation, which correctly identifies a point that requires greater transparency. Our experiments used a general-purpose knowledge graph (Wikidata) whose entity and relation structure is not identical to the Wikipedia-derived category taxonomies in the DBpedia dataset or the label hierarchies in WoS and Amazon. Retrieval operates via embedding similarity between the input document and KG entities rather than direct lookup of label strings or parent-child edges. Nevertheless, the manuscript does not explicitly document exclusion filters or a formal disjoint-KG protocol. We will therefore revise the Experiments section to specify the exact KG source, the embedding-based retrieval procedure, and any post-retrieval checks confirming that retrieved subgraphs do not contain the ground-truth label definitions or hierarchy edges. This addition will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity; method uses external KGs and public benchmarks without self-referential reduction

full rationale

The paper describes a RAG-based retrieval of subgraphs from general-purpose knowledge graphs to augment LLMs for zero-shot HTC. No equations, parameters, or derivations are presented that reduce by construction to fitted inputs or self-citations. The central claim relies on external structured knowledge and standard evaluation on WoS, DBpedia, and Amazon datasets, remaining self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that knowledge graphs contain label-relevant facts that RAG can surface usefully for LLMs; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Knowledge graphs contain structured semantic information relevant to taxonomy labels that can be retrieved to augment LLM understanding in zero-shot settings.
    Invoked as the core mechanism enabling performance gains without labeled data.

pith-pipeline@v0.9.0 · 5776 in / 1050 out tokens · 35238 ms · 2026-05-22T15:32:07.017329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Taxon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance

    cs.LG 2026-01 unverdicted novelty 6.0

    Taxon uses a mixture-of-experts architecture and LLM-derived semantic verification to achieve state-of-the-art performance on hierarchical tax code prediction and has been deployed in production at Alibaba.

  2. SC-Taxo: Hierarchical Taxonomy Generation under Semantic Consistency Constraints using Large Language Models

    cs.CL 2026-05 unverdicted novelty 4.0

    SC-Taxo adds bidirectional heading generation and peer semantic dependency modeling to LLMs to produce taxonomies with improved hierarchy alignment and heading quality on scientific literature benchmarks.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In international semantic web conference, pages 722–735. Springer, 2007

  2. [2]

    Bongiovanni, L

    L. Bongiovanni, L. Bruno, F. Dominici, and G. Rizzo. Zero-shot tax- onomy mapping for document classification. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, pages 911–918, 2023

  3. [3]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language mod- els are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Chatterjee, A

    S. Chatterjee, A. Maheshwari, G. Ramakrishnan, and S. N. Jagarlapudi. Joint learning of hyperbolic label embeddings for hierarchical multi- label classification. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2829– 2841, On...

  5. [5]

    D. Chen, Z. Yu, and S. R. Bowman. Clean or annotate: How to spend a limited data collection budget. In C. Cherry, A. Fan, G. Fos- ter, G. R. Haffari, S. Khadivi, N. V . Peng, X. Ren, E. Shareghi, and S. Swayamdipta, editors, Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 152–168, Hybrid, July 2022. A...

  6. [6]

    H. Chen, Q. Ma, Z. Lin, and J. Yan. Hierarchy-aware label semantics matching network for hierarchical text classification. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing (Volume 1: Long P...

  7. [7]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Pro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers), pages 4171–4186, 2019

  8. [8]

    D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

  9. [9]

    W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6491–6501, 2024

  10. [10]

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang. Retrieval-augmented generation for large language mod- els: A survey. arXiv preprint arXiv:2312.10997, 2, 2023

  11. [11]

    Halder, A

    K. Halder, A. Akbik, J. Krapac, and R. V ollgraf. Task-aware rep- resentation of sentences for generic text classification. In D. Scott, N. Bel, and C. Zong, editors,Proceedings of the 28th International Con- ference on Computational Linguistics , pages 3202–3213, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. ...

  12. [12]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021

  13. [13]

    Huang, E

    W. Huang, E. Chen, Q. Liu, Y . Chen, Z. Huang, Y . Liu, Z. Zhao, D. Zhang, and S. Wang. Hierarchical multi-label text classification: An attention-based recurrent network approach. In Proceedings of the 28th ACM international conference on information and knowledge manage- ment, pages 1051–1060, 2019

  14. [14]

    Karpukhin, B

    V . Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain ques- tion answering. In B. Webber, T. Cohn, Y . He, and Y . Liu, ed- itors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 6769–6781, On- line, Nov. 2020. Association for Compu...

  15. [15]

    Kashnitsky

    Y . Kashnitsky. Hierarchical text classification, 2020. URL https://www. kaggle.com/dsv/1054619

  16. [16]

    Kowsari, D

    K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber, and L. E. Barnes. Hdltex: Hierarchical deep learning for text classification. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) , pages 364–371, 2017. doi: 10. 1109/ICMLA.2017.0-134

  17. [17]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

  18. [18]

    T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen. Long- context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024

  19. [19]

    Z. Li, Q. Zang, D. Ma, J. Guo, T. Zheng, M. Liu, X. Niu, Y . Wang, J. Yang, J. Liu, et al. Autokaggle: A multi-agent framework for au- tonomous data science competitions. arXiv preprint arXiv:2410.20424, 2024

  20. [20]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  21. [21]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  22. [22]

    Y . Liu, K. Zhang, Z. Huang, K. Wang, Y . Zhang, Q. Liu, and E. Chen. Enhancing hierarchical text classification through knowledge graph in- tegration. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 5797–5810, Toronto, Canada, July 2023. Association for Com- putational Lin...

  23. [23]

    Z. Luo, X. Song, H. Huang, J. Lian, C. Zhang, J. Jiang, and X. Xie. Graphinstruct: Empowering large language models with graph under- standing and reasoning capability. arXiv preprint arXiv:2403.04483 , 2024

  24. [24]

    Y . Meng, J. Shen, C. Zhang, and J. Han. Weakly-supervised hierarchical text classification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6826–6833, 2019

  25. [25]

    Paletto, V

    L. Paletto, V . Basile, and R. Esposito. Label augmentation for zero-shot hierarchical text classification. In L.-W. Ku, A. Martins, and V . Sriku- mar, editors, Proceedings of the 62nd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers) , pages 7697–7706, Bangkok, Thailand, Aug. 2024. Association for Compu- tational ...

  26. [26]

    Patel, P

    D. Patel, P. Dangati, J.-Y . Lee, M. Boratko, and A. McCallum. Modeling label space interactions in multi-label classification using box embed- dings. ICLR 2022 Poster, 2022

  27. [27]

    B. Peng, Y . Zhu, Y . Liu, X. Bo, H. Shi, C. Hong, Y . Zhang, and S. Tang. Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921, 2024

  28. [28]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. OpenAI, 2018

  29. [29]

    Simonofski, J

    A. Simonofski, J. Fink, and C. Burnay. Supporting policy-making with social media and e-participation platforms data: A policy analytics framework. Government Information Quarterly, 38(3):101590, 2021

  30. [30]

    Sun and E.-P

    A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proceedings 2001 IEEE International Conference on Data Mining , pages 521–528, 2001. doi: 10.1109/ICDM.2001.989560

  31. [31]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  32. [32]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  33. [33]

    Zhang, J

    Q. Zhang, J. Dong, H. Chen, D. Zha, Z. Yu, and X. Huang. Knowgpt: Knowledge graph based prompting for large language models. Ad- vances in Neural Information Processing Systems , 37:6052–6080, 2024

  34. [34]

    Zhang, R

    Y . Zhang, R. Yang, X. Xu, R. Li, J. Xiao, J. Shen, and J. Han. Teleclass: Taxonomy enrichment and llm-enhanced hierarchical text classification with minimal supervision. arXiv preprint arXiv:2403.00165, 2024

  35. [35]

    K. Zhu, Q. Zang, S. Jia, S. Wu, F. Fang, Y . Li, S. Gavin, T. Zheng, J. Guo, B. Li, et al. Lime: Less is more for mllm evaluation. arXiv preprint arXiv:2409.06851, 2024