Jobs' AI Exposure Should Be Measured from Evidence, Not Model Priors
Pith reviewed 2026-05-19 14:25 UTC · model grok-4.3
The pith
AI job exposure should be measured with retrieved evidence of real capabilities rather than zero-shot LLM assertions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A retrieval-augmented framework that assigns AI exposure labels to occupation-task pairs by consulting retrieved news articles and academic abstracts produces assessments preferred by both humans and automatic judges over zero-shot baselines and that align more closely with actual AI usage in practice.
What carries the argument
The retrieval-augmented framework that supplies retrieved documents as external evidence to open-weight reasoning and instruct models when labeling each O*NET occupation-task pair for AI exposure.
If this is right
- AI exposure measurements must meet standards of reproducibility, external grounding, and inspectability.
- Theoretical exposure scores should be reassessed periodically as new evidence of capabilities emerges.
- Policy and workforce planning should draw on validated, evidence-linked labels rather than model priors alone.
- Grounded methods yield scores that better reflect plausible current uses of AI systems.
Where Pith is reading between the lines
- The same retrieval approach could be applied to emerging task taxonomies or international job classifications to test consistency across contexts.
- Ongoing monitoring of new publications could automate updates to exposure scores without full re-labeling each cycle.
- Workers and unions might use inspectable evidence trails to negotiate training programs tied to documented capability gaps.
- Integration with labor-market data on actual AI tool adoption could create closed-loop validation for the framework.
Load-bearing premise
The retrieved news articles and academic abstracts supply sufficient, representative, and unbiased evidence of what current AI systems can actually do across tasks.
What would settle it
A new dataset or study demonstrating that zero-shot LLM exposure labels predict observed real-world AI adoption rates more accurately than the retrieval-augmented labels would falsify the superiority claim.
Figures
read the original abstract
This position paper argues that job exposure to AI should be measured with grounded, evidence-based methods, not inferred from LLM priors alone. Current theoretical exposure measures use zero-shot prompting to classify task-level AI exposure, generating labels with no explicit evidence, no transparent chain of reasoning, and no external validation. The stakes of these measurements are too high to rely on such methods, as they influence policy making, where public and private funds are directed, and how workers understand their future prospects. We therefore argue that AI capability claims should meet three standards: reproducibility, external grounding, and inspectability. We propose a retrieval-augmented framework that assigns AI exposure labels to all 18,796 occupation--task pairs in O*NET 30.2, using open-weight reasoning and instruct models with retrieved news articles and academic paper abstracts as evidence of current AI capabilities. Relative to a zero-shot baseline, the grounded condition is preferred in over 72\% of disagreement cases under both automatic and human evaluation, and yields scores that align more closely with observed real-world AI usage. Taken together, these findings suggest that evidence-grounded measurement better captures what current AI systems can plausibly do in practice, rather than what a model asserts without external evidence. Because AI capabilities continue to change, the measurements used to inform policy must evolve with them: theoretical AI exposure scores should be periodically reassessed, not inherited as immutable ground truth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that AI exposure for jobs should be measured using a retrieval-augmented framework drawing on news articles and academic paper abstracts as evidence of current capabilities, rather than zero-shot LLM prompting on O*NET tasks. It applies the method to all 18,796 occupation-task pairs in O*NET 30.2, reports that the evidence-grounded labels are preferred over a zero-shot baseline in over 72% of disagreement cases under both automatic and human evaluation, and produce scores that align more closely with observed real-world AI usage.
Significance. If the empirical support holds, the work offers a more reproducible, inspectable, and externally grounded alternative to purely model-prior-based exposure scores. This is relevant for policy applications where such measurements influence funding and worker expectations. The use of open-weight models and explicit retrieval is a methodological strength that supports the reproducibility and inspectability standards the authors advocate.
major comments (3)
- [Abstract / Evaluation] Abstract and evaluation description: the reported 72% preference rate and improved real-world alignment are central to the claim, yet no information is provided on the number of disagreement cases, human evaluation sample size, inter-rater reliability, or controls for retrieval quality. These omissions leave the strength of the empirical comparison difficult to assess.
- [Retrieval-augmented framework] Description of the retrieval-augmented framework: the central claim requires that retrieved documents supply sufficient, task-specific evidence for labeling all 18,796 O*NET pairs. The manuscript does not discuss or quantify coverage gaps for low-visibility or highly specialized tasks, which could force fallback to model priors and introduce selection effects not present in the zero-shot baseline.
- [Results] Results section on real-world alignment: the claim that the method 'aligns more closely with observed real-world AI usage' is load-bearing for the superiority argument, but the specific metrics, data sources, and statistical comparison used for this alignment are not detailed.
minor comments (1)
- [Abstract] The abstract could explicitly state the total number of occupation-task pairs early for reader orientation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential value of an evidence-grounded approach to measuring AI exposure. We address each major comment below and have prepared revisions to improve the clarity and completeness of the empirical sections.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: the reported 72% preference rate and improved real-world alignment are central to the claim, yet no information is provided on the number of disagreement cases, human evaluation sample size, inter-rater reliability, or controls for retrieval quality. These omissions leave the strength of the empirical comparison difficult to assess.
Authors: We agree that these details are necessary for readers to evaluate the strength of the reported 72% preference and alignment results. The current manuscript states the overall preference rate but does not break out the supporting statistics in the abstract or evaluation summary. In the revised version we will add the exact number of disagreement cases, the human evaluation sample size, inter-rater reliability (including any kappa or agreement metric), and a concise description of retrieval-quality controls such as relevance thresholds and spot-checking procedures. These additions will be placed in both the abstract and a new short evaluation subsection. revision: yes
-
Referee: [Retrieval-augmented framework] Description of the retrieval-augmented framework: the central claim requires that retrieved documents supply sufficient, task-specific evidence for labeling all 18,796 O*NET pairs. The manuscript does not discuss or quantify coverage gaps for low-visibility or highly specialized tasks, which could force fallback to model priors and introduce selection effects not present in the zero-shot baseline.
Authors: The referee correctly notes that coverage gaps are not quantified. Our framework retrieves from news and academic abstracts and falls back to the model only when no relevant evidence is found; however, we do not currently report the fraction of tasks with insufficient retrieval or analyze whether this introduces differential selection relative to zero-shot. In the revision we will add a coverage analysis that reports retrieval success rates stratified by task visibility or specialization level and discuss any implications for comparability with the baseline. revision: yes
-
Referee: [Results] Results section on real-world alignment: the claim that the method 'aligns more closely with observed real-world AI usage' is load-bearing for the superiority argument, but the specific metrics, data sources, and statistical comparison used for this alignment are not detailed.
Authors: We acknowledge that the real-world alignment claim requires explicit methodological detail. The manuscript asserts closer alignment but does not specify the metrics, external usage data, or statistical tests. In the revised results section we will report the precise alignment metrics (e.g., correlation or rank agreement), the source of the observed usage indicators, and the statistical procedure used to compare the two labeling methods, including any significance assessment. revision: yes
Circularity Check
No circularity: results from independent empirical comparisons
full rationale
The paper's central derivation proposes a retrieval-augmented labeling framework for O*NET tasks and supports its superiority via direct head-to-head preference evaluations (human and automatic) against a zero-shot baseline plus alignment checks against observed real-world AI usage statistics. These validation steps are external to the framework itself and do not reduce to fitted parameters, self-definitions, or self-citation chains. No load-bearing premise relies on prior work by the same authors; the argument remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Retrieved news articles and academic abstracts provide sufficient and unbiased evidence of current AI capabilities for task-level judgments.
Reference graph
Works this paper leans on
-
[1]
Daron Acemoglu and Pascual Restrepo. Automation and new tasks: How technology displaces and reinstates labor.Journal of Economic Perspectives, 33(2):3–30, May 2019. doi: 10.1257/ jep.33.2.3. URLhttps://www.aeaweb.org/articles?id=10.1257/jep.33.2.3
-
[2]
Autor, Frank Levy, and Richard J
David H. Autor, Frank Levy, and Richard J. Murnane. The skill content of recent technological change: An empirical exploration*.The Quarterly Journal of Economics, 118(4):1279–1333, 11 2003. ISSN 0033-5533. doi: 10.1162/003355303322552801. URL https://doi.org/10. 1162/003355303322552801
-
[3]
Stereo- typing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets
Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereo- typing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. InPro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long P...
-
[4]
Neurips should lead scientific consensus on ai policy.arXiv preprint arXiv:2510.00075, 2025
Rishi Bommasani. Neurips should lead scientific consensus on ai policy.arXiv preprint arXiv:2510.00075, 2025
-
[5]
Measuring the intensive margin of work: Task shares and concentration
Pierre Bouquet and Yossi Sheffi. Measuring the intensive margin of work: Task shares and concentration. Research Paper 2026/004, MIT Center for Transportation & Logistics, February
work page 2026
- [6]
-
[7]
News sentiment as a dynamic predictor of job automation risk
Pierre Bouquet, Yossi Sheffi, and Amin Kaboli. News sentiment as a dynamic predictor of job automation risk. Research Paper 2026/002, MIT Center for Transportation & Logistics, January
work page 2026
-
[8]
Available at SSRN:https://ssrn.com/abstract=6168446. 9
-
[9]
Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond. Generative ai at work.Quarterly Journal of Economics, 140(2):889–942, 2025. doi: 10.1093/qje/qjae044
-
[10]
Gen-AI.Staff Discussion Notes, 2024(001):1, 1 2024
Mauro Cazzaniga. Gen-AI.Staff Discussion Notes, 2024(001):1, 1 2024. ISSN 2617-6750. doi: 10.5089/9798400262548.006. URLhttp://dx.doi.org/10.5089/9798400262548.006
-
[11]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024
work page 2024
-
[12]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, 2024. URLhttps://aclanthology.org/2024.findings-acl.137/
work page 2024
-
[13]
Fabrizio Dell’Acqua, Edward McFowland, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Working Paper 24-013, Harvard Business School, 2...
work page 2023
-
[14]
Gpts are gpts: Labor market impact potential of llms.Science, 384(6702):1306–1308, 2024
Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: Labor market impact potential of llms.Science, 384(6702):1306–1308, 2024
work page 2024
-
[15]
Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021. doi: https://doi.org/10.1002/smj.3286. URL https://sms. onlinelibrary.wiley.com/doi/abs/10.1002/smj.3286
-
[16]
Felten, Manav Raj, and Robert Seamans
Edward W. Felten, Manav Raj, and Robert Seamans. Occupational heterogeneity in exposure to generative AI. Technical report, SSRN, April 2023. URL https://ssrn.com/abstract= 4414065. Available at SSRN
work page 2023
-
[17]
Frequent use of ai in the workplace continued to rise in q4
Gallup. Frequent use of ai in the workplace continued to rise in q4. https://www. gallup.com/workplace/701195/frequent-workplace-continued-rise.aspx , 2026. Accessed 2026-04-17
work page 2026
-
[18]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024. doi: 10.48550/arXiv.2312.10997. URLhttps://arxiv.org/abs/2312.10997
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
-
[19]
Gemma Team, Google DeepMind. Gemma 4 model card, 2026. URL https://ai.google. dev/gemma/docs/core/model_card_4. Accessed: April 21, 2026
work page 2026
-
[20]
Generative AI and jobs: A refined global index of occupational exposure
Pawel Gmyrek, Janine Berg, Karol Kaminski, Filip Konopczy ´nski, Agnieszka Ładna, Balint Nafradi, Konrad Rosłaniec, and Marek Troszy ´nski. Generative AI and jobs: A refined global index of occupational exposure. Ilo research brief, International Labour Organization, Geneva, 2025. URL https://www.ilo.org/publications/ generative-ai-and-jobs-refined-global...
work page 2025
-
[21]
Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 3–10, Virtual Event, Canada, 2021. Association for Computing Machinery. doi: 10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901
-
[22]
Prometheus 2: An open source language model specialized in evaluating other language models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024. 10
work page 2024
-
[23]
Rodney Michael Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Ba...
-
[24]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022
work page 2022
-
[25]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[26]
Gdelt: Global data on events, language, and tone, 1979-
Kalev Leetaru and Philip Schrodt. Gdelt: Global data on events, language, and tone, 1979-
work page 1979
-
[27]
URL https://www.gdeltproject.org/
InInternational Studies Association Annual Conference, San Francisco, CA, 2013. URL https://www.gdeltproject.org/
work page 2013
-
[28]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Making large language models a better foundation for dense retrieval, 2023
Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Making large language models a better foundation for dense retrieval, 2023
work page 2023
-
[30]
Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey
Minghan Li, Xinxuan Lv, Junjie Zou, Tongna Chen, Chao Zhang, Suchao An, Ercong Nie, and Guodong Zhou. Query expansion in the age of pre-trained and large language models: A comprehensive survey.arXiv preprint arXiv:2509.07794, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
A technique for the measurement of attitudes.Archives of Psychology, 22(140): 55, 1932
Rensis Likert. A technique for the measurement of attitudes.Archives of Psychology, 22(140): 55, 1932
work page 1932
-
[32]
Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
G-eval: NLG evaluation using GPT-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore,
work page 2023
-
[34]
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/
-
[35]
Labor market impacts of ai: A new measure and early evidence.Anthropic Research, 5, 2026
Maxim Massenkoff and Peter McCrory. Labor market impacts of ai: A new measure and early evidence.Anthropic Research, 5, 2026
work page 2026
-
[36]
National Center for O*NET Development. O*net 30.2 database. O*NET Resource Center,
-
[37]
URL https://www.onetcenter.org/database.html. U.S. Department of Labor, Employment and Training Administration (USDOL/ETA). Accessed 30 April 2026. Licensed under CC BY 4.0
work page 2026
-
[38]
Lacking Control Increases Illusory Pattern Perception
Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial intelligence.Science, 381(6654):187–192, 2023. doi: 10.1126/science. adh2586. 11
-
[39]
OECD.OECD Employment Outlook 2023: Artificial Intelligence and the Labour Market. OECD Publishing, Paris, 2023. doi: 10.1787/08785bba-en. URL https://doi.org/10. 1787/08785bba-en
-
[40]
Introducing GPT-5.4 mini and nano
OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 17 2026. Accessed: May 5, 2026
work page 2026
-
[41]
Retrieval augmen- tation reduces hallucination in conversation, 2021
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmen- tation reduces hallucination in conversation, 2021. URL https://arxiv.org/abs/2104. 07567
work page 2021
-
[42]
Labor market AI exposure: What do we know? Technical re- port, Yale Budget Lab, February 2026
The Budget Lab at Yale. Labor market AI exposure: What do we know? Technical re- port, Yale Budget Lab, February 2026. URL https://budgetlab.yale.edu/research/ labor-market-ai-exposure-what-do-we-know
work page 2026
- [44]
-
[45]
The impact of artificial intelligence on the labor market
Michael Webb. The impact of artificial intelligence on the labor market. SSRN Working Paper, November 2019. URLhttps://ssrn.com/abstract=3482150
work page 2019
-
[46]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
work page 2022
-
[47]
The future of jobs report 2025
World Economic Forum. The future of jobs report 2025. Technical report, World Economic Forum, Geneva, Switzerland, January 2025. URL https://reports.weforum.org/docs/ WEF_Future_of_Jobs_Report_2025.pdf
work page 2025
-
[48]
Llm-based agents for tool learning: A survey: W
Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025
work page 2025
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems,
-
[51]
URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html. A Rubric and Model Inputs This appendix provides supplementary material for the labeling framework, retrieval corpus con- struction, and evaluation procedures used in the main paper. We first document the exposure rubric ...
work page 2023
-
[52]
Occupation Observed Theoretical AI exposure (%) Gemma Qwen Ministral Eloundou et al
for Gemma, Qwen and Ministral models. Occupation Observed Theoretical AI exposure (%) Gemma Qwen Ministral Eloundou et al. Computer Programmers 74.5% 88.3% 78.1% 88.7% 95.0% Customer Service Representatives 70.1% 50.0% 53.9% 87.9% 56.8% Data Entry Keyers 67.1% 51.5% 39.4% 48.5% 89.3% Medical Records Specialists 66.7% 47.1% 58.8% 58.8% 61.8% Market Researc...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.