Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

Peter Norlander; Stephen Meisenbacher

arxiv: 2605.21029 · v1 · pith:NKABGPL6new · submitted 2026-05-20 · 💻 cs.CL

Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

Stephen Meisenbacher , Peter Norlander This is my paper

Pith reviewed 2026-05-21 05:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords taxonomy constructionAI skillsjob postingsLLM applicationsdata filteringhierarchical taxonomyworkplace skills analysis

0 comments

The pith

Filtering job postings data creates clearer AI skills taxonomies than using the full unfiltered corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines design choices for building taxonomies of AI skills from job postings using LLMs. It introduces TaxonomyBuilder to test different configurations of data inclusion for creating custom and hierarchical taxonomies. The central result is that filtering the input data leads to taxonomies with superior domain-specific coverage compared to applying clustering and LLM labeling directly to unfiltered large datasets. A sympathetic reader would care because it shows how to efficiently map complex, growing domains like AI workplace skills without being overwhelmed by data volume.

Core claim

Utilizing LLMs for automated taxonomy construction presents an opportunity for mapping complex domains efficiently. Using two large-scale job postings corpora, the authors investigate how to best leverage data for optimal taxonomy construction in the case of AI skills. They propose TaxonomyBuilder as a blueprint for systematic study and evaluate configurations of custom, data-informed, and hierarchical taxonomies, demonstrating that filtering inputs provides better domain-specific coverage than unfiltered inputs to clustering and LLM-enhanced tools.

What carries the argument

TaxonomyBuilder, a proposed blueprint for systematically evaluating configurations of custom, data-informed, and hierarchical taxonomies built from job postings data.

If this is right

Taxonomies for AI skills can achieve better coverage by selectively filtering job postings rather than using all available data.
Data-informed approaches outperform standard clustering and LLM hierarchical labeling when inputs are filtered for relevance.
Systematic evaluation of data inclusion decisions improves the quality of automated taxonomies in high-volume domains.
The method can extend to systematizing skills in other rapidly growing fields using similar corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations building internal skill databases might reduce noise by pre-filtering sources before taxonomy generation.
Similar filtering principles could apply to other LLM uses on large text corpora to improve output specificity.
Future work could test TaxonomyBuilder on different domains to confirm if less data generally provides more clarity.
Job platforms could integrate such filtered taxonomies for better matching of AI roles and candidates.

Load-bearing premise

The job postings corpora used are representative of actual AI skills and tasks in the workplace and have minimal bias or noise that would affect the taxonomy.

What would settle it

An independent evaluation where unfiltered data produces taxonomies that match or exceed the coverage of filtered ones when compared to a gold-standard set of AI skills derived from expert review or additional sources.

Figures

Figures reproduced from arXiv: 2605.21029 by Peter Norlander, Stephen Meisenbacher.

**Figure 1.** Figure 1: The TAXONOMYBUILDER method. In the top lane, we detail the setup method we follow as a precursor to taxonomy construction, which consists of keyword-based context mining and class-based scoring. The TAXONOMYBUILDER method, in turn, consists of two primary stages (depicted in the center and bottom lanes): (1) the construction of the foundation (leaf) level, followed by iterative vertical construction of fu… view at source ↗

**Figure 2.** Figure 2: Abridged example of the taxonomy structure produced by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Filtering job postings before LLM taxonomy building gives better AI skill coverage than unfiltered clustering, but the coverage metric itself lacks independent validation.

read the letter

Hi colleague, the main takeaway is that this paper finds filtering inputs to their TaxonomyBuilder produces taxonomies with stronger domain-specific coverage for AI skills and tasks than feeding raw job postings into clustering plus LLM labeling. They run the comparison on two large job posting corpora and test several configurations for building custom hierarchical taxonomies. The result that less data can yield clearer output is the practical hook. What the work does reasonably well is stay close to real employment data rather than abstract skill lists. That choice makes the taxonomies more relevant for workforce or education uses, and the systematic look at data inclusion decisions is a straightforward way to handle noisy, high-volume text. The proposal of TaxonomyBuilder as a reusable blueprint is also a concrete step that others could try. The soft spots sit mainly in the evaluation. The claim of better coverage is stated but the abstract and available details do not show a pre-registered quantitative metric, held-out test postings, or inter-annotator agreement on expert judgments. If the assessment leans on qualitative review or the same data used for construction, the advantage could partly reflect the filtering heuristic rather than genuine improvement in capturing workplace skills. The assumption that the two corpora are representative and low-bias also sits untested in the summary. This paper is for readers who build or apply skill taxonomies in applied settings like HR analytics or AI education planning. Someone looking for a data-driven method to organize job text would find the configuration tests useful even if the numbers are light. It deserves peer review. The empirical comparison is worth referee time, provided the authors add clearer validation steps and address how coverage was scored.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TaxonomyBuilder as a blueprint for constructing custom, data-informed, hierarchical taxonomies of AI skills and tasks from job postings. Using two large-scale job postings corpora, the authors investigate design decisions around data inclusion/exclusion and claim that filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

Significance. If the central empirical comparison holds under independent validation, the result would be significant for practical taxonomy construction in high-volume, rapidly evolving domains such as AI workplace skills, by demonstrating that targeted data filtering can outperform unfiltered LLM-assisted clustering pipelines.

major comments (2)

[Abstract] Abstract: the main finding is stated without any quantitative metrics, validation procedures, or details on how 'domain-specific coverage' was measured or compared, preventing assessment of the claim.
[Evaluation] Evaluation section (or equivalent): the demonstration that filtered inputs outperform unfiltered clustering + LLM labeling requires a reproducible, pre-registered metric for coverage (e.g., held-out test set of postings, inter-annotator agreement on expert ratings, or disjoint validation corpus). Without this, the result risks being driven by alignment between the filtering heuristic and the chosen assessment rather than genuine improvement.

minor comments (1)

[Methods] Clarify the precise operational definition of 'domain-specific coverage' and the two corpora used (size, source, preprocessing) in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the main finding is stated without any quantitative metrics, validation procedures, or details on how 'domain-specific coverage' was measured or compared, preventing assessment of the claim.

Authors: We agree that the abstract would benefit from greater specificity. In the revised version, we will update the abstract to include key quantitative metrics (such as coverage percentages on validation postings for filtered versus unfiltered pipelines) and a concise description of how domain-specific coverage was operationalized and compared. This will enable readers to assess the central claim more directly. revision: yes
Referee: [Evaluation] Evaluation section (or equivalent): the demonstration that filtered inputs outperform unfiltered clustering + LLM labeling requires a reproducible, pre-registered metric for coverage (e.g., held-out test set of postings, inter-annotator agreement on expert ratings, or disjoint validation corpus). Without this, the result risks being driven by alignment between the filtering heuristic and the chosen assessment rather than genuine improvement.

Authors: We acknowledge the value of an explicitly reproducible metric. Our evaluation already relies on a disjoint validation corpus of job postings excluded from taxonomy construction, measuring the taxonomy's coverage of AI skills in these held-out postings. We will revise the Evaluation section to describe this procedure in greater detail, including any quantitative thresholds or agreement measures employed, to support independent reproduction and mitigate concerns about heuristic alignment. While the study was not pre-registered, the expanded description will address the core reproducibility issue. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of taxonomy construction methods

full rationale

The paper conducts an empirical study comparing TaxonomyBuilder configurations on two job postings corpora, evaluating filtered versus unfiltered inputs for domain-specific coverage in AI skills taxonomies. No equations, derivations, or self-definitional reductions are present. The central claim rests on direct comparison of outputs from data-driven processes rather than any fitted parameter or self-citation chain that collapses back to the inputs by construction. The work is self-contained as a standard empirical evaluation of design choices in automated taxonomy building.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; full methods, metrics, and parameter choices are not available. The work rests on the representativeness of job postings and the reliability of LLM labeling.

axioms (1)

domain assumption Job postings corpora accurately reflect current AI skills and tasks in the workplace
The study uses these corpora as the primary data source for taxonomy construction.

pith-pipeline@v0.9.0 · 5657 in / 1185 out tokens · 32499 ms · 2026-05-21T05:20:15.601307+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose TAXONOMYBUILDER as a blueprint... filtering inputs to TAXONOMYBUILDER provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

less data can provide more clarity: filtering inputs... percentile filtering

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

[1]

Building

Acemoglu, Daron and Autor, David and Johnson, Simon , month = feb, year =. Building

work page
[2]

2023 , month = aug, url =

The. 2023 , month = aug, url =

work page 2023
[3]

and Autor, David and Bessen, James E

Frank, Morgan R. and Autor, David and Bessen, James E. and Brynjolfsson, Erik and Cebrian, Manuel and Deming, David J. and Feldman, Maryann and Groh, Matthew and Lobo, José and Moro, Esteban and Wang, Dashun and Youn, Hyejin and Rahwan, Iyad , month = apr, year =. Toward understanding the impact of artificial intelligence on labor , volume =. Proceedings ...

work page doi:10.1073/pnas.1900949116
[4]

Artificial intelligence and skills in the workplace:

Margaryan, Anoush , month = jul, year =. Artificial intelligence and skills in the workplace:. Big Data & Society , publisher =. doi:10.1177/20539517231206804 , language =

work page doi:10.1177/20539517231206804
[5]

Journal of Economic Literature , author =

Digital. Journal of Economic Literature , author =. 2019 , pages =. doi:10.1257/jel.20171452 , language =

work page doi:10.1257/jel.20171452 2019
[6]

The Journal of Industrial Economics , author =

Some. The Journal of Industrial Economics , author =. 2002 , pages =. doi:10.1111/1467-6451.00174 , language =

work page doi:10.1111/1467-6451.00174 2002
[7]

Improving data access democratizes and diversifies science , volume =

Nagaraj, Abhishek and Shears, Esther and de Vaan, Mathijs , month = sep, year =. Improving data access democratizes and diversifies science , volume =. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2001682117 , number =

work page doi:10.1073/pnas.2001682117
[8]

Nagaraj, Abhishek , month = jan, year =. The. Management Science , publisher =. doi:10.1287/mnsc.2020.3878 , number =

work page doi:10.1287/mnsc.2020.3878 2020
[9]

Artificial

National Academies of Sciences,. Artificial. 2025 , keywords =. doi:10.17226/27644 , language =

work page doi:10.17226/27644 2025
[10]

The labor market impacts of technological change:

Autor, David , editor =. The labor market impacts of technological change:. An. 2022 , pages =

work page 2022
[11]

, month = jun, year =

Lane, Julia and Owen-Smith, Jason and Weinberg, Bruce A. , month = jun, year =. How to track the economic impact of public investments in. Nature , publisher =. doi:10.1038/d41586-024-01721-1 , language =

work page doi:10.1038/d41586-024-01721-1
[12]

Zweig, Ben , year =. Job

work page
[13]

Eng , volume=

Automated taxonomy construction using large language models: A comparative study of fine-tuning and prompt engineering , author=. Eng , volume=. 2025 , publisher=

work page 2025
[14]

European journal of information systems , volume=

A method for taxonomy development and its application in information systems , author=. European journal of information systems , volume=. 2013 , publisher=

work page 2013
[15]

Machine learning , pages=

Learning from observation: Conceptual clustering , author=. Machine learning , pages=. 1983 , publisher=

work page 1983
[16]

Clustering and classification , pages=

Hierarchical classification , author=. Clustering and classification , pages=. 1996 , publisher=

work page 1996
[17]

International Conference on Data Warehousing and Knowledge Discovery , pages=

Towards the automatic construction of conceptual taxonomies , author=. International Conference on Data Warehousing and Knowledge Discovery , pages=. 2008 , organization=

work page 2008
[18]

Proceedings of the 16th European Conference on Artificial Intelligence , pages =

Cimiano, Philipp and Hotho, Andreas and Staab, Steffen , title =. Proceedings of the 16th European Conference on Artificial Intelligence , pages =. 2004 , isbn =

work page 2004
[19]

Semantic Web , volume=

Large language models for creation, enrichment and evaluation of taxonomic graphs , author=. Semantic Web , volume=. 2026 , publisher=

work page 2026
[20]

Business & Information Systems Engineering , pages=

Semi-Automatic Hierarchical Taxonomy Creation from Existing Taxonomies with Large Language Models , author=. Business & Information Systems Engineering , pages=. 2026 , publisher=

work page 2026
[21]

LLMT axo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media

Zhang, Haiqi and Zhu, Zhengyuan and Zhang, Zeyu and Li, Chengkai. LLMT axo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1007

work page doi:10.18653/v1/2025.findings-acl.1007 2025
[22]

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction , year=

Chen, Boqi and Yi, Fandi and Varró, Dániel , booktitle=. Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction , year=

work page
[23]

Automatic Acquisition of Hyponyms from Large Text Corpora

Hearst, Marti A. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992 Volume 2: The 14th I nternational C onference on C omputational L inguistics. 1992

work page 1992
[24]

Dependency-Based Construction of Semantic Space Models

Pad \'o , Sebastian and Lapata, Mirella. Dependency-Based Construction of Semantic Space Models. Computational Linguistics. 2007. doi:10.1162/coli.2007.33.2.161

work page doi:10.1162/coli.2007.33.2.161 2007
[25]

Taxonomy Induction Using Hierarchical Random Graphs

Fountain, Trevor and Lapata, Mirella. Taxonomy Induction Using Hierarchical Random Graphs. Proceedings of the 2012 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012

work page 2012
[26]

Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , pages =

Pasca, Marius , title =. Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , pages =. 2004 , isbn =. doi:10.1145/1031171.1031194 , abstract =

work page doi:10.1145/1031171.1031194 2004
[27]

Semantic Taxonomy Induction from Heterogenous Evidence

Snow, Rion and Jurafsky, Daniel and Ng, Andrew Y. Semantic Taxonomy Induction from Heterogenous Evidence. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 2006. doi:10.3115/1220175.1220276

work page doi:10.3115/1220175.1220276 2006
[28]

O nto L earn Reloaded: A Graph-Based Algorithm for Taxonomy Induction

Velardi, Paola and Faralli, Stefano and Navigli, Roberto. O nto L earn Reloaded: A Graph-Based Algorithm for Taxonomy Induction. Computational Linguistics. 2013. doi:10.1162/COLI_a_00146

work page doi:10.1162/coli_a_00146 2013
[29]

Unsupervised Ontology Induction from Text

Poon, Hoifung and Domingos, Pedro. Unsupervised Ontology Induction from Text. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010

work page 2010
[30]

Learning Semantic Hierarchies via Word Embeddings

Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting. Learning Semantic Hierarchies via Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014. doi:10.3115/v1/P14-1113

work page doi:10.3115/v1/p14-1113 2014
[31]

Supervised Distributional Hypernym Discovery via Domain Adaptation

Espinosa-Anke, Luis and Camacho-Collados, Jose and Delli Bovi, Claudio and Saggion, Horacio. Supervised Distributional Hypernym Discovery via Domain Adaptation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1041

work page doi:10.18653/v1/d16-1041 2016
[32]

End-to-End Reinforcement Learning for Automatic Taxonomy Induction

Mao, Yuning and Ren, Xiang and Shen, Jiaming and Gu, Xiaotao and Han, Jiawei. End-to-End Reinforcement Learning for Automatic Taxonomy Induction. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1229

work page doi:10.18653/v1/p18-1229 2018
[33]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

Zhang, Chao and Tao, Fangbo and Chen, Xiusi and Shen, Jiaming and Jiang, Meng and Sadler, Brian and Vanni, Michelle and Han, Jiawei , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2018 , isbn =. doi:10.1145/3219819.3220064 , abstract =

work page doi:10.1145/3219819.3220064 2018
[34]

A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

Kozareva, Zornitsa and Hovy, Eduard. A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010

work page 2010
[35]

Intuitionistic and Type-2 fuzzy logic enhancements in neural and optimization algorithms: Theory and applications , pages=

Automated ontology extraction from unstructured texts using deep learning , author=. Intuitionistic and Type-2 fuzzy logic enhancements in neural and optimization algorithms: Theory and applications , pages=. 2020 , publisher=

work page 2020
[36]

T axo A dapt: Aligning LLM -Based Multidimensional Taxonomy Construction to Evolving Research Corpora

Kargupta, Priyanka and Zhang, Nan and Zhang, Yunyi and Zhang, Rui and Mitra, Prasenjit and Han, Jiawei. T axo A dapt: Aligning LLM -Based Multidimensional Taxonomy Construction to Evolving Research Corpora. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1442

work page doi:10.18653/v1/2025.acl-long.1442 2025
[37]

Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering

Zhu, Kun and Liao, Lizi and Gu, Yuxuan and Huang, Lei and Feng, Xiaocheng and Qin, Bing. Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.788

work page doi:10.18653/v1/2025.emnlp-main.788 2025
[38]

Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages =

Huang, Chen and He, Guoxiu , title =. Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages =. 2025 , isbn =. doi:10.1145/3767695.3769519 , abstract =

work page doi:10.1145/3767695.3769519 2025
[39]

Building Data-Driven Occupation Taxonomies: A Bottom-Up Multi-Stage Approach via Semantic Clustering and Multi-Agent Collaboration

Li, Nan and Kang, Bo and De Bie, Tijl. Building Data-Driven Occupation Taxonomies: A Bottom-Up Multi-Stage Approach via Semantic Clustering and Multi-Agent Collaboration. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.113

work page doi:10.18653/v1/2025.emnlp-industry.113 2025
[40]

Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages =

Zeng, Qingkai and Bai, Yuyang and Tan, Zhaoxuan and Feng, Shangbin and Liang, Zhenwen and Zhang, Zhihan and Jiang, Meng , title =. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages =. 2024 , isbn =. doi:10.1145/3627673.3679608 , abstract =

work page doi:10.1145/3627673.3679608 2024
[41]

, author=

TaxoRankConstruct: A Novel Rank-based Iterative Approach To Taxonomy Construction With Large Language Models. , author=. ISS@ IT&I , pages=. 2024 , url=

work page 2024
[42]

and Yang, Longqi and Andersen, Reid and Buscher, Georg and Joshi, Dhruv and Rangan, Nagu , title =

Wan, Mengting and Safavi, Tara and Jauhar, Sujay Kumar and Kim, Yujin and Counts, Scott and Neville, Jennifer and Suri, Siddharth and Shah, Chirag and White, Ryen W. and Yang, Longqi and Andersen, Reid and Buscher, Georg and Joshi, Dhruv and Rangan, Nagu , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , page...

work page doi:10.1145/3637528.3671647 2024
[43]

T axo A lign: Scholarly Taxonomy Generation Using Language Models

Lahiri, Avishek and Hou, Yufang and Sanyal, Debarshi Kumar. T axo A lign: Scholarly Taxonomy Generation Using Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1536

work page doi:10.18653/v1/2025.emnlp-main.1536 2025
[44]

William Resh and Keunyoung Lee and Yi Ming. U.S. Federal Civil Position Job Postings (2018-2023). 2025. doi:10.6084/m9.figshare.28509314.v5

work page doi:10.6084/m9.figshare.28509314.v5 2018
[45]

Identifying and measuring developments in artificial intelligence:

Baruffaldi, Stefano and Beuzekom, Brigitte van and Dernis, Hélène and Harhoff, Dietmar and Rao, Nandan and Rosenfeld, David and Squicciarini, Mariagrazia , month = apr, year =. Identifying and measuring developments in artificial intelligence:. OECD Science, Technology and Industry Working Papers , publisher =. doi:10.1787/5f65ff7e-en , abstract =

work page doi:10.1787/5f65ff7e-en
[46]

Computer Aided Geometric Design 88, 102002

The demand for. Labour Economics , author =. 2021 , keywords =. doi:10.1016/j.labeco.2021.102002 , abstract =

work page doi:10.1016/j.labeco.2021.102002 2021
[47]

The changing economics of knowledge production , volume =

Abis, Simona and Veldkamp, Laura , year =. The changing economics of knowledge production , volume =. The Review of Financial Studies , publisher =

work page
[48]

Research Policy , author =

Could machine learning be a general purpose technology?. Research Policy , author =. 2023 , keywords =. doi:10.1016/j.respol.2022.104653 , abstract =

work page doi:10.1016/j.respol.2022.104653 2023
[49]

2021 , pages =

Management Information Systems Quarterly , author =. 2021 , pages =

work page 2021
[50]

Artificial

Maslej, Nestor and Fattorini, Loredana and Perrault, Raymond and Gil, Yolanda and Parli, Vanessa and Kariuki, Njenga and Capstick, Emily and Reuel, Anka and Brynjolfsson, Erik and Etchemendy, John and Ligett, Katrina and Lyons, Terah and Manyika, James and Niebles, Juan Carlos and Shoham, Yoav and Wald, Russell and Walsh, Toby and Hamrah, Armin and Santar...

work page doi:10.48550/arxiv.2504.07139
[51]

, month = jan, year =

Tambe, Prasanna B. , month = jan, year =. Reskilling the. Management Science , publisher =. doi:10.1287/mnsc.2022.03968 , abstract =

work page doi:10.1287/mnsc.2022.03968 2022
[52]

International conference on web information systems and technologies , pages=

Semantic label representations with lbl2vec: A similarity-based approach for unsupervised text classification , author=. International conference on web information systems and technologies , pages=. 2020 , organization=

work page 2020
[53]

An Improved Method for Class-specific Keyword Extraction: A Case Study in the G erman Business Registry

Meisenbacher, Stephen and Schopf, Tim and Yan, Weixin and Holl, Patrick and Matthes, Florian. An Improved Method for Class-specific Keyword Extraction: A Case Study in the G erman Business Registry. Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024). 2024

work page 2024
[54]

doi: 10.18653/v1/D19-1410

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[55]

2023 , eprint=

Towards General Text Embeddings with Multi-stage Contrastive Learning , author=. 2023 , eprint=

work page 2023
[56]

2025 , eprint=

EmbeddingGemma: Powerful and Lightweight Text Representations , author=. 2025 , eprint=

work page 2025
[57]

Pacific-Asia conference on knowledge discovery and data mining , pages=

Density-based clustering based on hierarchical density estimates , author=. Pacific-Asia conference on knowledge discovery and data mining , pages=. 2013 , organization=

work page 2013
[58]

Journal of Open Source Software , volume=

UMAP: Uniform Manifold Approximation and Projection , author=. Journal of Open Source Software , volume=. 2018 , doi=

work page 2018
[59]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024
[60]

Introducing an Evaluation Method for Taxonomies , year =

Kaplan, Angelika and K\". Introducing an Evaluation Method for Taxonomies , year =. Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering , pages =. doi:10.1145/3530019.3535305 , abstract =

work page doi:10.1145/3530019.3535305
[61]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025

[1] [1]

Building

Acemoglu, Daron and Autor, David and Johnson, Simon , month = feb, year =. Building

work page

[2] [2]

2023 , month = aug, url =

The. 2023 , month = aug, url =

work page 2023

[3] [3]

and Autor, David and Bessen, James E

Frank, Morgan R. and Autor, David and Bessen, James E. and Brynjolfsson, Erik and Cebrian, Manuel and Deming, David J. and Feldman, Maryann and Groh, Matthew and Lobo, José and Moro, Esteban and Wang, Dashun and Youn, Hyejin and Rahwan, Iyad , month = apr, year =. Toward understanding the impact of artificial intelligence on labor , volume =. Proceedings ...

work page doi:10.1073/pnas.1900949116

[4] [4]

Artificial intelligence and skills in the workplace:

Margaryan, Anoush , month = jul, year =. Artificial intelligence and skills in the workplace:. Big Data & Society , publisher =. doi:10.1177/20539517231206804 , language =

work page doi:10.1177/20539517231206804

[5] [5]

Journal of Economic Literature , author =

Digital. Journal of Economic Literature , author =. 2019 , pages =. doi:10.1257/jel.20171452 , language =

work page doi:10.1257/jel.20171452 2019

[6] [6]

The Journal of Industrial Economics , author =

Some. The Journal of Industrial Economics , author =. 2002 , pages =. doi:10.1111/1467-6451.00174 , language =

work page doi:10.1111/1467-6451.00174 2002

[7] [7]

Improving data access democratizes and diversifies science , volume =

Nagaraj, Abhishek and Shears, Esther and de Vaan, Mathijs , month = sep, year =. Improving data access democratizes and diversifies science , volume =. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2001682117 , number =

work page doi:10.1073/pnas.2001682117

[8] [8]

Nagaraj, Abhishek , month = jan, year =. The. Management Science , publisher =. doi:10.1287/mnsc.2020.3878 , number =

work page doi:10.1287/mnsc.2020.3878 2020

[9] [9]

Artificial

National Academies of Sciences,. Artificial. 2025 , keywords =. doi:10.17226/27644 , language =

work page doi:10.17226/27644 2025

[10] [10]

The labor market impacts of technological change:

Autor, David , editor =. The labor market impacts of technological change:. An. 2022 , pages =

work page 2022

[11] [11]

, month = jun, year =

Lane, Julia and Owen-Smith, Jason and Weinberg, Bruce A. , month = jun, year =. How to track the economic impact of public investments in. Nature , publisher =. doi:10.1038/d41586-024-01721-1 , language =

work page doi:10.1038/d41586-024-01721-1

[12] [12]

Zweig, Ben , year =. Job

work page

[13] [13]

Eng , volume=

Automated taxonomy construction using large language models: A comparative study of fine-tuning and prompt engineering , author=. Eng , volume=. 2025 , publisher=

work page 2025

[14] [14]

European journal of information systems , volume=

A method for taxonomy development and its application in information systems , author=. European journal of information systems , volume=. 2013 , publisher=

work page 2013

[15] [15]

Machine learning , pages=

Learning from observation: Conceptual clustering , author=. Machine learning , pages=. 1983 , publisher=

work page 1983

[16] [16]

Clustering and classification , pages=

Hierarchical classification , author=. Clustering and classification , pages=. 1996 , publisher=

work page 1996

[17] [17]

International Conference on Data Warehousing and Knowledge Discovery , pages=

Towards the automatic construction of conceptual taxonomies , author=. International Conference on Data Warehousing and Knowledge Discovery , pages=. 2008 , organization=

work page 2008

[18] [18]

Proceedings of the 16th European Conference on Artificial Intelligence , pages =

Cimiano, Philipp and Hotho, Andreas and Staab, Steffen , title =. Proceedings of the 16th European Conference on Artificial Intelligence , pages =. 2004 , isbn =

work page 2004

[19] [19]

Semantic Web , volume=

Large language models for creation, enrichment and evaluation of taxonomic graphs , author=. Semantic Web , volume=. 2026 , publisher=

work page 2026

[20] [20]

Business & Information Systems Engineering , pages=

Semi-Automatic Hierarchical Taxonomy Creation from Existing Taxonomies with Large Language Models , author=. Business & Information Systems Engineering , pages=. 2026 , publisher=

work page 2026

[21] [21]

LLMT axo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media

Zhang, Haiqi and Zhu, Zhengyuan and Zhang, Zeyu and Li, Chengkai. LLMT axo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1007

work page doi:10.18653/v1/2025.findings-acl.1007 2025

[22] [22]

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction , year=

Chen, Boqi and Yi, Fandi and Varró, Dániel , booktitle=. Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction , year=

work page

[23] [23]

Automatic Acquisition of Hyponyms from Large Text Corpora

Hearst, Marti A. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992 Volume 2: The 14th I nternational C onference on C omputational L inguistics. 1992

work page 1992

[24] [24]

Dependency-Based Construction of Semantic Space Models

Pad \'o , Sebastian and Lapata, Mirella. Dependency-Based Construction of Semantic Space Models. Computational Linguistics. 2007. doi:10.1162/coli.2007.33.2.161

work page doi:10.1162/coli.2007.33.2.161 2007

[25] [25]

Taxonomy Induction Using Hierarchical Random Graphs

Fountain, Trevor and Lapata, Mirella. Taxonomy Induction Using Hierarchical Random Graphs. Proceedings of the 2012 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012

work page 2012

[26] [26]

Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , pages =

Pasca, Marius , title =. Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , pages =. 2004 , isbn =. doi:10.1145/1031171.1031194 , abstract =

work page doi:10.1145/1031171.1031194 2004

[27] [27]

Semantic Taxonomy Induction from Heterogenous Evidence

Snow, Rion and Jurafsky, Daniel and Ng, Andrew Y. Semantic Taxonomy Induction from Heterogenous Evidence. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 2006. doi:10.3115/1220175.1220276

work page doi:10.3115/1220175.1220276 2006

[28] [28]

O nto L earn Reloaded: A Graph-Based Algorithm for Taxonomy Induction

Velardi, Paola and Faralli, Stefano and Navigli, Roberto. O nto L earn Reloaded: A Graph-Based Algorithm for Taxonomy Induction. Computational Linguistics. 2013. doi:10.1162/COLI_a_00146

work page doi:10.1162/coli_a_00146 2013

[29] [29]

Unsupervised Ontology Induction from Text

Poon, Hoifung and Domingos, Pedro. Unsupervised Ontology Induction from Text. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010

work page 2010

[30] [30]

Learning Semantic Hierarchies via Word Embeddings

Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting. Learning Semantic Hierarchies via Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014. doi:10.3115/v1/P14-1113

work page doi:10.3115/v1/p14-1113 2014

[31] [31]

Supervised Distributional Hypernym Discovery via Domain Adaptation

Espinosa-Anke, Luis and Camacho-Collados, Jose and Delli Bovi, Claudio and Saggion, Horacio. Supervised Distributional Hypernym Discovery via Domain Adaptation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1041

work page doi:10.18653/v1/d16-1041 2016

[32] [32]

End-to-End Reinforcement Learning for Automatic Taxonomy Induction

Mao, Yuning and Ren, Xiang and Shen, Jiaming and Gu, Xiaotao and Han, Jiawei. End-to-End Reinforcement Learning for Automatic Taxonomy Induction. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1229

work page doi:10.18653/v1/p18-1229 2018

[33] [33]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

Zhang, Chao and Tao, Fangbo and Chen, Xiusi and Shen, Jiaming and Jiang, Meng and Sadler, Brian and Vanni, Michelle and Han, Jiawei , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2018 , isbn =. doi:10.1145/3219819.3220064 , abstract =

work page doi:10.1145/3219819.3220064 2018

[34] [34]

A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

Kozareva, Zornitsa and Hovy, Eduard. A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010

work page 2010

[35] [35]

Intuitionistic and Type-2 fuzzy logic enhancements in neural and optimization algorithms: Theory and applications , pages=

Automated ontology extraction from unstructured texts using deep learning , author=. Intuitionistic and Type-2 fuzzy logic enhancements in neural and optimization algorithms: Theory and applications , pages=. 2020 , publisher=

work page 2020

[36] [36]

T axo A dapt: Aligning LLM -Based Multidimensional Taxonomy Construction to Evolving Research Corpora

Kargupta, Priyanka and Zhang, Nan and Zhang, Yunyi and Zhang, Rui and Mitra, Prasenjit and Han, Jiawei. T axo A dapt: Aligning LLM -Based Multidimensional Taxonomy Construction to Evolving Research Corpora. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1442

work page doi:10.18653/v1/2025.acl-long.1442 2025

[37] [37]

Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering

Zhu, Kun and Liao, Lizi and Gu, Yuxuan and Huang, Lei and Feng, Xiaocheng and Qin, Bing. Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.788

work page doi:10.18653/v1/2025.emnlp-main.788 2025

[38] [38]

Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages =

Huang, Chen and He, Guoxiu , title =. Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages =. 2025 , isbn =. doi:10.1145/3767695.3769519 , abstract =

work page doi:10.1145/3767695.3769519 2025

[39] [39]

Building Data-Driven Occupation Taxonomies: A Bottom-Up Multi-Stage Approach via Semantic Clustering and Multi-Agent Collaboration

Li, Nan and Kang, Bo and De Bie, Tijl. Building Data-Driven Occupation Taxonomies: A Bottom-Up Multi-Stage Approach via Semantic Clustering and Multi-Agent Collaboration. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.113

work page doi:10.18653/v1/2025.emnlp-industry.113 2025

[40] [40]

Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages =

Zeng, Qingkai and Bai, Yuyang and Tan, Zhaoxuan and Feng, Shangbin and Liang, Zhenwen and Zhang, Zhihan and Jiang, Meng , title =. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages =. 2024 , isbn =. doi:10.1145/3627673.3679608 , abstract =

work page doi:10.1145/3627673.3679608 2024

[41] [41]

, author=

TaxoRankConstruct: A Novel Rank-based Iterative Approach To Taxonomy Construction With Large Language Models. , author=. ISS@ IT&I , pages=. 2024 , url=

work page 2024

[42] [42]

and Yang, Longqi and Andersen, Reid and Buscher, Georg and Joshi, Dhruv and Rangan, Nagu , title =

Wan, Mengting and Safavi, Tara and Jauhar, Sujay Kumar and Kim, Yujin and Counts, Scott and Neville, Jennifer and Suri, Siddharth and Shah, Chirag and White, Ryen W. and Yang, Longqi and Andersen, Reid and Buscher, Georg and Joshi, Dhruv and Rangan, Nagu , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , page...

work page doi:10.1145/3637528.3671647 2024

[43] [43]

T axo A lign: Scholarly Taxonomy Generation Using Language Models

Lahiri, Avishek and Hou, Yufang and Sanyal, Debarshi Kumar. T axo A lign: Scholarly Taxonomy Generation Using Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1536

work page doi:10.18653/v1/2025.emnlp-main.1536 2025

[44] [44]

William Resh and Keunyoung Lee and Yi Ming. U.S. Federal Civil Position Job Postings (2018-2023). 2025. doi:10.6084/m9.figshare.28509314.v5

work page doi:10.6084/m9.figshare.28509314.v5 2018

[45] [45]

Identifying and measuring developments in artificial intelligence:

Baruffaldi, Stefano and Beuzekom, Brigitte van and Dernis, Hélène and Harhoff, Dietmar and Rao, Nandan and Rosenfeld, David and Squicciarini, Mariagrazia , month = apr, year =. Identifying and measuring developments in artificial intelligence:. OECD Science, Technology and Industry Working Papers , publisher =. doi:10.1787/5f65ff7e-en , abstract =

work page doi:10.1787/5f65ff7e-en

[46] [46]

Computer Aided Geometric Design 88, 102002

The demand for. Labour Economics , author =. 2021 , keywords =. doi:10.1016/j.labeco.2021.102002 , abstract =

work page doi:10.1016/j.labeco.2021.102002 2021

[47] [47]

The changing economics of knowledge production , volume =

Abis, Simona and Veldkamp, Laura , year =. The changing economics of knowledge production , volume =. The Review of Financial Studies , publisher =

work page

[48] [48]

Research Policy , author =

Could machine learning be a general purpose technology?. Research Policy , author =. 2023 , keywords =. doi:10.1016/j.respol.2022.104653 , abstract =

work page doi:10.1016/j.respol.2022.104653 2023

[49] [49]

2021 , pages =

Management Information Systems Quarterly , author =. 2021 , pages =

work page 2021

[50] [50]

Artificial

Maslej, Nestor and Fattorini, Loredana and Perrault, Raymond and Gil, Yolanda and Parli, Vanessa and Kariuki, Njenga and Capstick, Emily and Reuel, Anka and Brynjolfsson, Erik and Etchemendy, John and Ligett, Katrina and Lyons, Terah and Manyika, James and Niebles, Juan Carlos and Shoham, Yoav and Wald, Russell and Walsh, Toby and Hamrah, Armin and Santar...

work page doi:10.48550/arxiv.2504.07139

[51] [51]

, month = jan, year =

Tambe, Prasanna B. , month = jan, year =. Reskilling the. Management Science , publisher =. doi:10.1287/mnsc.2022.03968 , abstract =

work page doi:10.1287/mnsc.2022.03968 2022

[52] [52]

International conference on web information systems and technologies , pages=

Semantic label representations with lbl2vec: A similarity-based approach for unsupervised text classification , author=. International conference on web information systems and technologies , pages=. 2020 , organization=

work page 2020

[53] [53]

An Improved Method for Class-specific Keyword Extraction: A Case Study in the G erman Business Registry

Meisenbacher, Stephen and Schopf, Tim and Yan, Weixin and Holl, Patrick and Matthes, Florian. An Improved Method for Class-specific Keyword Extraction: A Case Study in the G erman Business Registry. Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024). 2024

work page 2024

[54] [54]

doi: 10.18653/v1/D19-1410

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[55] [55]

2023 , eprint=

Towards General Text Embeddings with Multi-stage Contrastive Learning , author=. 2023 , eprint=

work page 2023

[56] [56]

2025 , eprint=

EmbeddingGemma: Powerful and Lightweight Text Representations , author=. 2025 , eprint=

work page 2025

[57] [57]

Pacific-Asia conference on knowledge discovery and data mining , pages=

Density-based clustering based on hierarchical density estimates , author=. Pacific-Asia conference on knowledge discovery and data mining , pages=. 2013 , organization=

work page 2013

[58] [58]

Journal of Open Source Software , volume=

UMAP: Uniform Manifold Approximation and Projection , author=. Journal of Open Source Software , volume=. 2018 , doi=

work page 2018

[59] [59]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024

[60] [60]

Introducing an Evaluation Method for Taxonomies , year =

Kaplan, Angelika and K\". Introducing an Evaluation Method for Taxonomies , year =. Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering , pages =. doi:10.1145/3530019.3535305 , abstract =

work page doi:10.1145/3530019.3535305

[61] [61]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025