Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings
Pith reviewed 2026-05-21 05:20 UTC · model grok-4.3
The pith
Filtering job postings data creates clearer AI skills taxonomies than using the full unfiltered corpus.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Utilizing LLMs for automated taxonomy construction presents an opportunity for mapping complex domains efficiently. Using two large-scale job postings corpora, the authors investigate how to best leverage data for optimal taxonomy construction in the case of AI skills. They propose TaxonomyBuilder as a blueprint for systematic study and evaluate configurations of custom, data-informed, and hierarchical taxonomies, demonstrating that filtering inputs provides better domain-specific coverage than unfiltered inputs to clustering and LLM-enhanced tools.
What carries the argument
TaxonomyBuilder, a proposed blueprint for systematically evaluating configurations of custom, data-informed, and hierarchical taxonomies built from job postings data.
If this is right
- Taxonomies for AI skills can achieve better coverage by selectively filtering job postings rather than using all available data.
- Data-informed approaches outperform standard clustering and LLM hierarchical labeling when inputs are filtered for relevance.
- Systematic evaluation of data inclusion decisions improves the quality of automated taxonomies in high-volume domains.
- The method can extend to systematizing skills in other rapidly growing fields using similar corpora.
Where Pith is reading between the lines
- Organizations building internal skill databases might reduce noise by pre-filtering sources before taxonomy generation.
- Similar filtering principles could apply to other LLM uses on large text corpora to improve output specificity.
- Future work could test TaxonomyBuilder on different domains to confirm if less data generally provides more clarity.
- Job platforms could integrate such filtered taxonomies for better matching of AI roles and candidates.
Load-bearing premise
The job postings corpora used are representative of actual AI skills and tasks in the workplace and have minimal bias or noise that would affect the taxonomy.
What would settle it
An independent evaluation where unfiltered data produces taxonomies that match or exceed the coverage of filtered ones when compared to a gold-standard set of AI skills derived from expert review or additional sources.
Figures
read the original abstract
Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TaxonomyBuilder as a blueprint for constructing custom, data-informed, hierarchical taxonomies of AI skills and tasks from job postings. Using two large-scale job postings corpora, the authors investigate design decisions around data inclusion/exclusion and claim that filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.
Significance. If the central empirical comparison holds under independent validation, the result would be significant for practical taxonomy construction in high-volume, rapidly evolving domains such as AI workplace skills, by demonstrating that targeted data filtering can outperform unfiltered LLM-assisted clustering pipelines.
major comments (2)
- [Abstract] Abstract: the main finding is stated without any quantitative metrics, validation procedures, or details on how 'domain-specific coverage' was measured or compared, preventing assessment of the claim.
- [Evaluation] Evaluation section (or equivalent): the demonstration that filtered inputs outperform unfiltered clustering + LLM labeling requires a reproducible, pre-registered metric for coverage (e.g., held-out test set of postings, inter-annotator agreement on expert ratings, or disjoint validation corpus). Without this, the result risks being driven by alignment between the filtering heuristic and the chosen assessment rather than genuine improvement.
minor comments (1)
- [Methods] Clarify the precise operational definition of 'domain-specific coverage' and the two corpora used (size, source, preprocessing) in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the main finding is stated without any quantitative metrics, validation procedures, or details on how 'domain-specific coverage' was measured or compared, preventing assessment of the claim.
Authors: We agree that the abstract would benefit from greater specificity. In the revised version, we will update the abstract to include key quantitative metrics (such as coverage percentages on validation postings for filtered versus unfiltered pipelines) and a concise description of how domain-specific coverage was operationalized and compared. This will enable readers to assess the central claim more directly. revision: yes
-
Referee: [Evaluation] Evaluation section (or equivalent): the demonstration that filtered inputs outperform unfiltered clustering + LLM labeling requires a reproducible, pre-registered metric for coverage (e.g., held-out test set of postings, inter-annotator agreement on expert ratings, or disjoint validation corpus). Without this, the result risks being driven by alignment between the filtering heuristic and the chosen assessment rather than genuine improvement.
Authors: We acknowledge the value of an explicitly reproducible metric. Our evaluation already relies on a disjoint validation corpus of job postings excluded from taxonomy construction, measuring the taxonomy's coverage of AI skills in these held-out postings. We will revise the Evaluation section to describe this procedure in greater detail, including any quantitative thresholds or agreement measures employed, to support independent reproduction and mitigate concerns about heuristic alignment. While the study was not pre-registered, the expanded description will address the core reproducibility issue. revision: partial
Circularity Check
No circularity: empirical comparison of taxonomy construction methods
full rationale
The paper conducts an empirical study comparing TaxonomyBuilder configurations on two job postings corpora, evaluating filtered versus unfiltered inputs for domain-specific coverage in AI skills taxonomies. No equations, derivations, or self-definitional reductions are present. The central claim rests on direct comparison of outputs from data-driven processes rather than any fitted parameter or self-citation chain that collapses back to the inputs by construction. The work is self-contained as a standard empirical evaluation of design choices in automated taxonomy building.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Job postings corpora accurately reflect current AI skills and tasks in the workplace
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose TAXONOMYBUILDER as a blueprint... filtering inputs to TAXONOMYBUILDER provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
less data can provide more clarity: filtering inputs... percentile filtering
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
and Autor, David and Bessen, James E
Frank, Morgan R. and Autor, David and Bessen, James E. and Brynjolfsson, Erik and Cebrian, Manuel and Deming, David J. and Feldman, Maryann and Groh, Matthew and Lobo, José and Moro, Esteban and Wang, Dashun and Youn, Hyejin and Rahwan, Iyad , month = apr, year =. Toward understanding the impact of artificial intelligence on labor , volume =. Proceedings ...
-
[4]
Artificial intelligence and skills in the workplace:
Margaryan, Anoush , month = jul, year =. Artificial intelligence and skills in the workplace:. Big Data & Society , publisher =. doi:10.1177/20539517231206804 , language =
-
[5]
Journal of Economic Literature , author =
Digital. Journal of Economic Literature , author =. 2019 , pages =. doi:10.1257/jel.20171452 , language =
-
[6]
The Journal of Industrial Economics , author =
Some. The Journal of Industrial Economics , author =. 2002 , pages =. doi:10.1111/1467-6451.00174 , language =
-
[7]
Improving data access democratizes and diversifies science , volume =
Nagaraj, Abhishek and Shears, Esther and de Vaan, Mathijs , month = sep, year =. Improving data access democratizes and diversifies science , volume =. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2001682117 , number =
-
[8]
Nagaraj, Abhishek , month = jan, year =. The. Management Science , publisher =. doi:10.1287/mnsc.2020.3878 , number =
-
[9]
National Academies of Sciences,. Artificial. 2025 , keywords =. doi:10.17226/27644 , language =
-
[10]
The labor market impacts of technological change:
Autor, David , editor =. The labor market impacts of technological change:. An. 2022 , pages =
work page 2022
-
[11]
Lane, Julia and Owen-Smith, Jason and Weinberg, Bruce A. , month = jun, year =. How to track the economic impact of public investments in. Nature , publisher =. doi:10.1038/d41586-024-01721-1 , language =
-
[12]
Zweig, Ben , year =. Job
-
[13]
Automated taxonomy construction using large language models: A comparative study of fine-tuning and prompt engineering , author=. Eng , volume=. 2025 , publisher=
work page 2025
-
[14]
European journal of information systems , volume=
A method for taxonomy development and its application in information systems , author=. European journal of information systems , volume=. 2013 , publisher=
work page 2013
-
[15]
Learning from observation: Conceptual clustering , author=. Machine learning , pages=. 1983 , publisher=
work page 1983
-
[16]
Clustering and classification , pages=
Hierarchical classification , author=. Clustering and classification , pages=. 1996 , publisher=
work page 1996
-
[17]
International Conference on Data Warehousing and Knowledge Discovery , pages=
Towards the automatic construction of conceptual taxonomies , author=. International Conference on Data Warehousing and Knowledge Discovery , pages=. 2008 , organization=
work page 2008
-
[18]
Proceedings of the 16th European Conference on Artificial Intelligence , pages =
Cimiano, Philipp and Hotho, Andreas and Staab, Steffen , title =. Proceedings of the 16th European Conference on Artificial Intelligence , pages =. 2004 , isbn =
work page 2004
-
[19]
Large language models for creation, enrichment and evaluation of taxonomic graphs , author=. Semantic Web , volume=. 2026 , publisher=
work page 2026
-
[20]
Business & Information Systems Engineering , pages=
Semi-Automatic Hierarchical Taxonomy Creation from Existing Taxonomies with Large Language Models , author=. Business & Information Systems Engineering , pages=. 2026 , publisher=
work page 2026
-
[21]
Zhang, Haiqi and Zhu, Zhengyuan and Zhang, Zeyu and Li, Chengkai. LLMT axo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1007
-
[22]
Chen, Boqi and Yi, Fandi and Varró, Dániel , booktitle=. Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction , year=
-
[23]
Automatic Acquisition of Hyponyms from Large Text Corpora
Hearst, Marti A. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992 Volume 2: The 14th I nternational C onference on C omputational L inguistics. 1992
work page 1992
-
[24]
Dependency-Based Construction of Semantic Space Models
Pad \'o , Sebastian and Lapata, Mirella. Dependency-Based Construction of Semantic Space Models. Computational Linguistics. 2007. doi:10.1162/coli.2007.33.2.161
-
[25]
Taxonomy Induction Using Hierarchical Random Graphs
Fountain, Trevor and Lapata, Mirella. Taxonomy Induction Using Hierarchical Random Graphs. Proceedings of the 2012 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012
work page 2012
-
[26]
Pasca, Marius , title =. Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , pages =. 2004 , isbn =. doi:10.1145/1031171.1031194 , abstract =
-
[27]
Semantic Taxonomy Induction from Heterogenous Evidence
Snow, Rion and Jurafsky, Daniel and Ng, Andrew Y. Semantic Taxonomy Induction from Heterogenous Evidence. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 2006. doi:10.3115/1220175.1220276
-
[28]
O nto L earn Reloaded: A Graph-Based Algorithm for Taxonomy Induction
Velardi, Paola and Faralli, Stefano and Navigli, Roberto. O nto L earn Reloaded: A Graph-Based Algorithm for Taxonomy Induction. Computational Linguistics. 2013. doi:10.1162/COLI_a_00146
-
[29]
Unsupervised Ontology Induction from Text
Poon, Hoifung and Domingos, Pedro. Unsupervised Ontology Induction from Text. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010
work page 2010
-
[30]
Learning Semantic Hierarchies via Word Embeddings
Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting. Learning Semantic Hierarchies via Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014. doi:10.3115/v1/P14-1113
-
[31]
Supervised Distributional Hypernym Discovery via Domain Adaptation
Espinosa-Anke, Luis and Camacho-Collados, Jose and Delli Bovi, Claudio and Saggion, Horacio. Supervised Distributional Hypernym Discovery via Domain Adaptation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1041
-
[32]
End-to-End Reinforcement Learning for Automatic Taxonomy Induction
Mao, Yuning and Ren, Xiang and Shen, Jiaming and Gu, Xiaotao and Han, Jiawei. End-to-End Reinforcement Learning for Automatic Taxonomy Induction. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1229
-
[33]
Zhang, Chao and Tao, Fangbo and Chen, Xiusi and Shen, Jiaming and Jiang, Meng and Sadler, Brian and Vanni, Michelle and Han, Jiawei , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2018 , isbn =. doi:10.1145/3219819.3220064 , abstract =
-
[34]
A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web
Kozareva, Zornitsa and Hovy, Eduard. A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010
work page 2010
-
[35]
Automated ontology extraction from unstructured texts using deep learning , author=. Intuitionistic and Type-2 fuzzy logic enhancements in neural and optimization algorithms: Theory and applications , pages=. 2020 , publisher=
work page 2020
-
[36]
Kargupta, Priyanka and Zhang, Nan and Zhang, Yunyi and Zhang, Rui and Mitra, Prasenjit and Han, Jiawei. T axo A dapt: Aligning LLM -Based Multidimensional Taxonomy Construction to Evolving Research Corpora. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1442
-
[37]
Zhu, Kun and Liao, Lizi and Gu, Yuxuan and Huang, Lei and Feng, Xiaocheng and Qin, Bing. Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.788
-
[38]
Huang, Chen and He, Guoxiu , title =. Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages =. 2025 , isbn =. doi:10.1145/3767695.3769519 , abstract =
-
[39]
Li, Nan and Kang, Bo and De Bie, Tijl. Building Data-Driven Occupation Taxonomies: A Bottom-Up Multi-Stage Approach via Semantic Clustering and Multi-Agent Collaboration. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.113
-
[40]
Zeng, Qingkai and Bai, Yuyang and Tan, Zhaoxuan and Feng, Shangbin and Liang, Zhenwen and Zhang, Zhihan and Jiang, Meng , title =. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management , pages =. 2024 , isbn =. doi:10.1145/3627673.3679608 , abstract =
- [41]
-
[42]
and Yang, Longqi and Andersen, Reid and Buscher, Georg and Joshi, Dhruv and Rangan, Nagu , title =
Wan, Mengting and Safavi, Tara and Jauhar, Sujay Kumar and Kim, Yujin and Counts, Scott and Neville, Jennifer and Suri, Siddharth and Shah, Chirag and White, Ryen W. and Yang, Longqi and Andersen, Reid and Buscher, Georg and Joshi, Dhruv and Rangan, Nagu , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , page...
-
[43]
T axo A lign: Scholarly Taxonomy Generation Using Language Models
Lahiri, Avishek and Hou, Yufang and Sanyal, Debarshi Kumar. T axo A lign: Scholarly Taxonomy Generation Using Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1536
-
[44]
William Resh and Keunyoung Lee and Yi Ming. U.S. Federal Civil Position Job Postings (2018-2023). 2025. doi:10.6084/m9.figshare.28509314.v5
-
[45]
Identifying and measuring developments in artificial intelligence:
Baruffaldi, Stefano and Beuzekom, Brigitte van and Dernis, Hélène and Harhoff, Dietmar and Rao, Nandan and Rosenfeld, David and Squicciarini, Mariagrazia , month = apr, year =. Identifying and measuring developments in artificial intelligence:. OECD Science, Technology and Industry Working Papers , publisher =. doi:10.1787/5f65ff7e-en , abstract =
-
[46]
Computer Aided Geometric Design 88, 102002
The demand for. Labour Economics , author =. 2021 , keywords =. doi:10.1016/j.labeco.2021.102002 , abstract =
-
[47]
The changing economics of knowledge production , volume =
Abis, Simona and Veldkamp, Laura , year =. The changing economics of knowledge production , volume =. The Review of Financial Studies , publisher =
-
[48]
Could machine learning be a general purpose technology?. Research Policy , author =. 2023 , keywords =. doi:10.1016/j.respol.2022.104653 , abstract =
- [49]
-
[50]
Maslej, Nestor and Fattorini, Loredana and Perrault, Raymond and Gil, Yolanda and Parli, Vanessa and Kariuki, Njenga and Capstick, Emily and Reuel, Anka and Brynjolfsson, Erik and Etchemendy, John and Ligett, Katrina and Lyons, Terah and Manyika, James and Niebles, Juan Carlos and Shoham, Yoav and Wald, Russell and Walsh, Toby and Hamrah, Armin and Santar...
-
[51]
Tambe, Prasanna B. , month = jan, year =. Reskilling the. Management Science , publisher =. doi:10.1287/mnsc.2022.03968 , abstract =
-
[52]
International conference on web information systems and technologies , pages=
Semantic label representations with lbl2vec: A similarity-based approach for unsupervised text classification , author=. International conference on web information systems and technologies , pages=. 2020 , organization=
work page 2020
-
[53]
Meisenbacher, Stephen and Schopf, Tim and Yan, Weixin and Holl, Patrick and Matthes, Florian. An Improved Method for Class-specific Keyword Extraction: A Case Study in the G erman Business Registry. Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024). 2024
work page 2024
-
[54]
Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410
-
[55]
Towards General Text Embeddings with Multi-stage Contrastive Learning , author=. 2023 , eprint=
work page 2023
-
[56]
EmbeddingGemma: Powerful and Lightweight Text Representations , author=. 2025 , eprint=
work page 2025
-
[57]
Pacific-Asia conference on knowledge discovery and data mining , pages=
Density-based clustering based on hierarchical density estimates , author=. Pacific-Asia conference on knowledge discovery and data mining , pages=. 2013 , organization=
work page 2013
-
[58]
Journal of Open Source Software , volume=
UMAP: Uniform Manifold Approximation and Projection , author=. Journal of Open Source Software , volume=. 2018 , doi=
work page 2018
- [59]
-
[60]
Introducing an Evaluation Method for Taxonomies , year =
Kaplan, Angelika and K\". Introducing an Evaluation Method for Taxonomies , year =. Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering , pages =. doi:10.1145/3530019.3535305 , abstract =
- [61]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.