Jobs' AI Exposure Should Be Measured from Evidence, Not Model Priors

Luca Mouchel; Pierre Bouquet; Yossi Sheffi

arxiv: 2605.15474 · v1 · pith:J23ECVWPnew · submitted 2026-05-14 · 💻 cs.IR

Jobs' AI Exposure Should Be Measured from Evidence, Not Model Priors

Luca Mouchel , Pierre Bouquet , Yossi Sheffi This is my paper

Pith reviewed 2026-05-19 14:25 UTC · model grok-4.3

classification 💻 cs.IR

keywords AI job exposureretrieval-augmented evaluationevidence-based measurementO*NET tasksLLM priorsAI capabilities assessmentpolicy implications

0 comments

The pith

AI job exposure should be measured with retrieved evidence of real capabilities rather than zero-shot LLM assertions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that assessments of how AI affects jobs must rely on explicit evidence from documents such as news articles and research abstracts instead of depending only on what language models claim without support. Current zero-shot methods label tasks without transparent reasoning or external checks, yet these scores shape policy funding and workers' career expectations. The authors present a retrieval-augmented framework that uses open-weight models to evaluate all 18,796 O*NET occupation-task pairs against pulled-in evidence of current AI performance. Human and automatic evaluations favor the evidence-based labels in more than 72 percent of cases where the two methods disagree, and the resulting scores track observed real-world AI adoption more closely. Because capabilities change, the paper concludes that exposure measures require periodic re-evaluation rather than permanent status.

Core claim

A retrieval-augmented framework that assigns AI exposure labels to occupation-task pairs by consulting retrieved news articles and academic abstracts produces assessments preferred by both humans and automatic judges over zero-shot baselines and that align more closely with actual AI usage in practice.

What carries the argument

The retrieval-augmented framework that supplies retrieved documents as external evidence to open-weight reasoning and instruct models when labeling each O*NET occupation-task pair for AI exposure.

If this is right

AI exposure measurements must meet standards of reproducibility, external grounding, and inspectability.
Theoretical exposure scores should be reassessed periodically as new evidence of capabilities emerges.
Policy and workforce planning should draw on validated, evidence-linked labels rather than model priors alone.
Grounded methods yield scores that better reflect plausible current uses of AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval approach could be applied to emerging task taxonomies or international job classifications to test consistency across contexts.
Ongoing monitoring of new publications could automate updates to exposure scores without full re-labeling each cycle.
Workers and unions might use inspectable evidence trails to negotiate training programs tied to documented capability gaps.
Integration with labor-market data on actual AI tool adoption could create closed-loop validation for the framework.

Load-bearing premise

The retrieved news articles and academic abstracts supply sufficient, representative, and unbiased evidence of what current AI systems can actually do across tasks.

What would settle it

A new dataset or study demonstrating that zero-shot LLM exposure labels predict observed real-world AI adoption rates more accurately than the retrieval-augmented labels would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2605.15474 by Luca Mouchel, Pierre Bouquet, Yossi Sheffi.

**Figure 2.** Figure 2: Industry-level alignment between theoretical AI exposure (x-axis) and observed Claude [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise preference judgments on occupation–task pairs for which the context and no [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Task-level exposure label distributions: with-context models vs. Eloundou et al. [12]. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Transition-specific pairwise preferences on disagreement cases. Each bubble corresponds [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Human annotation interface used for pairwise evaluation on disagreement cases. Annotators [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the five most covered and five least covered job families by retrieved context [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

This position paper argues that job exposure to AI should be measured with grounded, evidence-based methods, not inferred from LLM priors alone. Current theoretical exposure measures use zero-shot prompting to classify task-level AI exposure, generating labels with no explicit evidence, no transparent chain of reasoning, and no external validation. The stakes of these measurements are too high to rely on such methods, as they influence policy making, where public and private funds are directed, and how workers understand their future prospects. We therefore argue that AI capability claims should meet three standards: reproducibility, external grounding, and inspectability. We propose a retrieval-augmented framework that assigns AI exposure labels to all 18,796 occupation--task pairs in O*NET 30.2, using open-weight reasoning and instruct models with retrieved news articles and academic paper abstracts as evidence of current AI capabilities. Relative to a zero-shot baseline, the grounded condition is preferred in over 72\% of disagreement cases under both automatic and human evaluation, and yields scores that align more closely with observed real-world AI usage. Taken together, these findings suggest that evidence-grounded measurement better captures what current AI systems can plausibly do in practice, rather than what a model asserts without external evidence. Because AI capabilities continue to change, the measurements used to inform policy must evolve with them: theoretical AI exposure scores should be periodically reassessed, not inherited as immutable ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Retrieval-augmented labeling improves on zero-shot for AI job exposure but the evaluation is too lightly documented to carry the main claims.

read the letter

The punchline on this paper is that retrieval-augmented generation can produce AI exposure labels for O*NET tasks that people and models prefer over zero-shot prompting, and that line up better with actual AI use in the wild. They ran it on the full set of 18,796 pairs using news articles and paper abstracts as evidence. What stands out as new is the end-to-end application of this grounded approach at scale with open models. The paper does well at explaining why policy decisions need transparent, evidence-based scores instead of black-box model assertions. The three standards they set—reproducibility, external grounding, and inspectability—make sense, and showing the preference rate and real-world alignment gives a concrete reason to take the idea seriously. The soft spots are mostly in how lightly the evaluation is documented. We get the 72% preference but no breakdown on the number of comparisons, the raters' expertise, or what happens when retrieval turns up little relevant material. That last point matters because the stress-test worry about coverage gaps for obscure tasks could mean the method only shines where evidence is easy to find. Without error analysis or controls for retrieval quality, it's hard to know how much the improvement holds up across the whole dataset. This work is aimed at labor economists and policymakers who rely on these exposure metrics for reskilling plans. Readers looking for a methodological alternative to pure LLM priors will get value from the proposal and the initial results. The thinking is clear and the engagement with the problem is honest, so it deserves a serious referee to sort out the evaluation gaps. I recommend putting it through peer review with specific asks for full methods, sample details, and checks on retrieval coverage.

Referee Report

3 major / 1 minor

Summary. This position paper argues that AI exposure for jobs should be measured using a retrieval-augmented framework drawing on news articles and academic paper abstracts as evidence of current capabilities, rather than zero-shot LLM prompting on O*NET tasks. It applies the method to all 18,796 occupation-task pairs in O*NET 30.2, reports that the evidence-grounded labels are preferred over a zero-shot baseline in over 72% of disagreement cases under both automatic and human evaluation, and produce scores that align more closely with observed real-world AI usage.

Significance. If the empirical support holds, the work offers a more reproducible, inspectable, and externally grounded alternative to purely model-prior-based exposure scores. This is relevant for policy applications where such measurements influence funding and worker expectations. The use of open-weight models and explicit retrieval is a methodological strength that supports the reproducibility and inspectability standards the authors advocate.

major comments (3)

[Abstract / Evaluation] Abstract and evaluation description: the reported 72% preference rate and improved real-world alignment are central to the claim, yet no information is provided on the number of disagreement cases, human evaluation sample size, inter-rater reliability, or controls for retrieval quality. These omissions leave the strength of the empirical comparison difficult to assess.
[Retrieval-augmented framework] Description of the retrieval-augmented framework: the central claim requires that retrieved documents supply sufficient, task-specific evidence for labeling all 18,796 O*NET pairs. The manuscript does not discuss or quantify coverage gaps for low-visibility or highly specialized tasks, which could force fallback to model priors and introduce selection effects not present in the zero-shot baseline.
[Results] Results section on real-world alignment: the claim that the method 'aligns more closely with observed real-world AI usage' is load-bearing for the superiority argument, but the specific metrics, data sources, and statistical comparison used for this alignment are not detailed.

minor comments (1)

[Abstract] The abstract could explicitly state the total number of occupation-task pairs early for reader orientation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of an evidence-grounded approach to measuring AI exposure. We address each major comment below and have prepared revisions to improve the clarity and completeness of the empirical sections.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the reported 72% preference rate and improved real-world alignment are central to the claim, yet no information is provided on the number of disagreement cases, human evaluation sample size, inter-rater reliability, or controls for retrieval quality. These omissions leave the strength of the empirical comparison difficult to assess.

Authors: We agree that these details are necessary for readers to evaluate the strength of the reported 72% preference and alignment results. The current manuscript states the overall preference rate but does not break out the supporting statistics in the abstract or evaluation summary. In the revised version we will add the exact number of disagreement cases, the human evaluation sample size, inter-rater reliability (including any kappa or agreement metric), and a concise description of retrieval-quality controls such as relevance thresholds and spot-checking procedures. These additions will be placed in both the abstract and a new short evaluation subsection. revision: yes
Referee: [Retrieval-augmented framework] Description of the retrieval-augmented framework: the central claim requires that retrieved documents supply sufficient, task-specific evidence for labeling all 18,796 O*NET pairs. The manuscript does not discuss or quantify coverage gaps for low-visibility or highly specialized tasks, which could force fallback to model priors and introduce selection effects not present in the zero-shot baseline.

Authors: The referee correctly notes that coverage gaps are not quantified. Our framework retrieves from news and academic abstracts and falls back to the model only when no relevant evidence is found; however, we do not currently report the fraction of tasks with insufficient retrieval or analyze whether this introduces differential selection relative to zero-shot. In the revision we will add a coverage analysis that reports retrieval success rates stratified by task visibility or specialization level and discuss any implications for comparability with the baseline. revision: yes
Referee: [Results] Results section on real-world alignment: the claim that the method 'aligns more closely with observed real-world AI usage' is load-bearing for the superiority argument, but the specific metrics, data sources, and statistical comparison used for this alignment are not detailed.

Authors: We acknowledge that the real-world alignment claim requires explicit methodological detail. The manuscript asserts closer alignment but does not specify the metrics, external usage data, or statistical tests. In the revised results section we will report the precise alignment metrics (e.g., correlation or rank agreement), the source of the observed usage indicators, and the statistical procedure used to compare the two labeling methods, including any significance assessment. revision: yes

Circularity Check

0 steps flagged

No circularity: results from independent empirical comparisons

full rationale

The paper's central derivation proposes a retrieval-augmented labeling framework for O*NET tasks and supports its superiority via direct head-to-head preference evaluations (human and automatic) against a zero-shot baseline plus alignment checks against observed real-world AI usage statistics. These validation steps are external to the framework itself and do not reduce to fitted parameters, self-definitions, or self-citation chains. No load-bearing premise relies on prior work by the same authors; the argument remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters or invented entities. It rests on one main domain assumption about the quality of retrieved evidence.

axioms (1)

domain assumption Retrieved news articles and academic abstracts provide sufficient and unbiased evidence of current AI capabilities for task-level judgments.
This premise underpins the entire retrieval-augmented framework and is required for the claim that the method is more grounded than zero-shot prompting.

pith-pipeline@v0.9.0 · 5785 in / 1287 out tokens · 57287 ms · 2026-05-19T14:25:23.621493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

[1]

Automation and new tasks: How technology displaces and reinstates labor.Journal of Economic Perspectives, 33(2):3–30, May 2019

Daron Acemoglu and Pascual Restrepo. Automation and new tasks: How technology displaces and reinstates labor.Journal of Economic Perspectives, 33(2):3–30, May 2019. doi: 10.1257/ jep.33.2.3. URLhttps://www.aeaweb.org/articles?id=10.1257/jep.33.2.3

work page doi:10.1257/jep.33.2.3 2019
[2]

Autor, Frank Levy, and Richard J

David H. Autor, Frank Levy, and Richard J. Murnane. The skill content of recent technological change: An empirical exploration*.The Quarterly Journal of Economics, 118(4):1279–1333, 11 2003. ISSN 0033-5533. doi: 10.1162/003355303322552801. URL https://doi.org/10. 1162/003355303322552801

work page doi:10.1162/003355303322552801 2003
[3]

Stereo- typing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereo- typing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. InPro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long P...

work page doi:10.18653/v1/2021.acl-long.81 2021
[4]

Neurips should lead scientific consensus on ai policy.arXiv preprint arXiv:2510.00075, 2025

Rishi Bommasani. Neurips should lead scientific consensus on ai policy.arXiv preprint arXiv:2510.00075, 2025

work page arXiv 2025
[5]

Measuring the intensive margin of work: Task shares and concentration

Pierre Bouquet and Yossi Sheffi. Measuring the intensive margin of work: Task shares and concentration. Research Paper 2026/004, MIT Center for Transportation & Logistics, February

work page 2026
[6]

Available at SSRN

URLhttps://ssrn.com/abstract=6174538. Available at SSRN

work page
[7]

News sentiment as a dynamic predictor of job automation risk

Pierre Bouquet, Yossi Sheffi, and Amin Kaboli. News sentiment as a dynamic predictor of job automation risk. Research Paper 2026/002, MIT Center for Transportation & Logistics, January

work page 2026
[8]

Available at SSRN:https://ssrn.com/abstract=6168446. 9

work page
[9]

Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond. Generative ai at work.Quarterly Journal of Economics, 140(2):889–942, 2025. doi: 10.1093/qje/qjae044

work page doi:10.1093/qje/qjae044 2025
[10]

Gen-AI.Staff Discussion Notes, 2024(001):1, 1 2024

Mauro Cazzaniga. Gen-AI.Staff Discussion Notes, 2024(001):1, 1 2024. ISSN 2617-6750. doi: 10.5089/9798400262548.006. URLhttp://dx.doi.org/10.5089/9798400262548.006

work page doi:10.5089/9798400262548.006 2024
[11]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024
[12]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, 2024. URLhttps://aclanthology.org/2024.findings-acl.137/

work page 2024
[13]

Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R

Fabrizio Dell’Acqua, Edward McFowland, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Working Paper 24-013, Harvard Business School, 2...

work page 2023
[14]

Gpts are gpts: Labor market impact potential of llms.Science, 384(6702):1306–1308, 2024

Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: Labor market impact potential of llms.Science, 384(6702):1306–1308, 2024

work page 2024
[15]

Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021

Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021. doi: https://doi.org/10.1002/smj.3286. URL https://sms. onlinelibrary.wiley.com/doi/abs/10.1002/smj.3286

work page doi:10.1002/smj.3286 2021
[16]

Felten, Manav Raj, and Robert Seamans

Edward W. Felten, Manav Raj, and Robert Seamans. Occupational heterogeneity in exposure to generative AI. Technical report, SSRN, April 2023. URL https://ssrn.com/abstract= 4414065. Available at SSRN

work page 2023
[17]

Frequent use of ai in the workplace continued to rise in q4

Gallup. Frequent use of ai in the workplace continued to rise in q4. https://www. gallup.com/workplace/701195/frequent-workplace-continued-rise.aspx , 2026. Accessed 2026-04-17

work page 2026
[18]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024. doi: 10.48550/arXiv.2312.10997. URLhttps://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
[19]

Gemma 4 model card, 2026

Gemma Team, Google DeepMind. Gemma 4 model card, 2026. URL https://ai.google. dev/gemma/docs/core/model_card_4. Accessed: April 21, 2026

work page 2026
[20]

Generative AI and jobs: A refined global index of occupational exposure

Pawel Gmyrek, Janine Berg, Karol Kaminski, Filip Konopczy ´nski, Agnieszka Ładna, Balint Nafradi, Konrad Rosłaniec, and Marek Troszy ´nski. Generative AI and jobs: A refined global index of occupational exposure. Ilo research brief, International Labour Organization, Geneva, 2025. URL https://www.ilo.org/publications/ generative-ai-and-jobs-refined-global...

work page 2025
[21]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 3–10, Virtual Event, Canada, 2021. Association for Computing Machinery. doi: 10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021
[22]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024. 10

work page 2024
[23]

Graham, F.Q

Rodney Michael Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Ba...

work page arXiv 2023
[24]

Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022

work page 2022
[25]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[26]

Gdelt: Global data on events, language, and tone, 1979-

Kalev Leetaru and Philip Schrodt. Gdelt: Global data on events, language, and tone, 1979-

work page 1979
[27]

URL https://www.gdeltproject.org/

InInternational Studies Association Annual Conference, San Francisco, CA, 2013. URL https://www.gdeltproject.org/

work page 2013
[28]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Making large language models a better foundation for dense retrieval, 2023

Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Making large language models a better foundation for dense retrieval, 2023

work page 2023
[30]

Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey

Minghan Li, Xinxuan Lv, Junjie Zou, Tongna Chen, Chao Zhang, Suchao An, Ercong Nie, and Guodong Zhou. Query expansion in the age of pre-trained and large language models: A comprehensive survey.arXiv preprint arXiv:2509.07794, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

A technique for the measurement of attitudes.Archives of Psychology, 22(140): 55, 1932

Rensis Likert. A technique for the measurement of attitudes.Archives of Psychology, 22(140): 55, 1932

work page 1932
[32]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

G-eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore,

work page 2023
[34]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[35]

Labor market impacts of ai: A new measure and early evidence.Anthropic Research, 5, 2026

Maxim Massenkoff and Peter McCrory. Labor market impacts of ai: A new measure and early evidence.Anthropic Research, 5, 2026

work page 2026
[36]

O*net 30.2 database

National Center for O*NET Development. O*net 30.2 database. O*NET Resource Center,

work page
[37]

URL https://www.onetcenter.org/database.html. U.S. Department of Labor, Employment and Training Administration (USDOL/ETA). Accessed 30 April 2026. Licensed under CC BY 4.0

work page 2026
[38]

Lacking Control Increases Illusory Pattern Perception

Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial intelligence.Science, 381(6654):187–192, 2023. doi: 10.1126/science. adh2586. 11

work page doi:10.1126/science 2023
[39]

OECD Publishing, Paris, 2023

OECD.OECD Employment Outlook 2023: Artificial Intelligence and the Labour Market. OECD Publishing, Paris, 2023. doi: 10.1787/08785bba-en. URL https://doi.org/10. 1787/08785bba-en

work page doi:10.1787/08785bba-en 2023
[40]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 17 2026. Accessed: May 5, 2026

work page 2026
[41]

Retrieval augmen- tation reduces hallucination in conversation, 2021

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmen- tation reduces hallucination in conversation, 2021. URL https://arxiv.org/abs/2104. 07567

work page 2021
[42]

Labor market AI exposure: What do we know? Technical re- port, Yale Budget Lab, February 2026

The Budget Lab at Yale. Labor market AI exposure: What do we know? Technical re- port, Yale Budget Lab, February 2026. URL https://budgetlab.yale.edu/research/ labor-market-ai-exposure-what-do-we-know

work page 2026
[44]

URLhttps://arxiv.org/abs/2507.07935

work page arXiv
[45]

The impact of artificial intelligence on the labor market

Michael Webb. The impact of artificial intelligence on the labor market. SSRN Working Paper, November 2019. URLhttps://ssrn.com/abstract=3482150

work page 2019
[46]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022
[47]

The future of jobs report 2025

World Economic Forum. The future of jobs report 2025. Technical report, World Economic Forum, Geneva, Switzerland, January 2025. URL https://reports.weforum.org/docs/ WEF_Future_of_Jobs_Report_2025.pdf

work page 2025
[48]

Llm-based agents for tool learning: A survey: W

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

work page 2025
[49]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems,

work page
[51]

URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html. A Rubric and Model Inputs This appendix provides supplementary material for the labeling framework, retrieval corpus con- struction, and evaluation procedures used in the main paper. We first document the exposure rubric ...

work page 2023
[52]

Occupation Observed Theoretical AI exposure (%) Gemma Qwen Ministral Eloundou et al

for Gemma, Qwen and Ministral models. Occupation Observed Theoretical AI exposure (%) Gemma Qwen Ministral Eloundou et al. Computer Programmers 74.5% 88.3% 78.1% 88.7% 95.0% Customer Service Representatives 70.1% 50.0% 53.9% 87.9% 56.8% Data Entry Keyers 67.1% 51.5% 39.4% 48.5% 89.3% Medical Records Specialists 66.7% 47.1% 58.8% 58.8% 61.8% Market Researc...

work page 2026

[1] [1]

Automation and new tasks: How technology displaces and reinstates labor.Journal of Economic Perspectives, 33(2):3–30, May 2019

Daron Acemoglu and Pascual Restrepo. Automation and new tasks: How technology displaces and reinstates labor.Journal of Economic Perspectives, 33(2):3–30, May 2019. doi: 10.1257/ jep.33.2.3. URLhttps://www.aeaweb.org/articles?id=10.1257/jep.33.2.3

work page doi:10.1257/jep.33.2.3 2019

[2] [2]

Autor, Frank Levy, and Richard J

David H. Autor, Frank Levy, and Richard J. Murnane. The skill content of recent technological change: An empirical exploration*.The Quarterly Journal of Economics, 118(4):1279–1333, 11 2003. ISSN 0033-5533. doi: 10.1162/003355303322552801. URL https://doi.org/10. 1162/003355303322552801

work page doi:10.1162/003355303322552801 2003

[3] [3]

Stereo- typing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereo- typing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. InPro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long P...

work page doi:10.18653/v1/2021.acl-long.81 2021

[4] [4]

Neurips should lead scientific consensus on ai policy.arXiv preprint arXiv:2510.00075, 2025

Rishi Bommasani. Neurips should lead scientific consensus on ai policy.arXiv preprint arXiv:2510.00075, 2025

work page arXiv 2025

[5] [5]

Measuring the intensive margin of work: Task shares and concentration

Pierre Bouquet and Yossi Sheffi. Measuring the intensive margin of work: Task shares and concentration. Research Paper 2026/004, MIT Center for Transportation & Logistics, February

work page 2026

[6] [6]

Available at SSRN

URLhttps://ssrn.com/abstract=6174538. Available at SSRN

work page

[7] [7]

News sentiment as a dynamic predictor of job automation risk

Pierre Bouquet, Yossi Sheffi, and Amin Kaboli. News sentiment as a dynamic predictor of job automation risk. Research Paper 2026/002, MIT Center for Transportation & Logistics, January

work page 2026

[8] [8]

Available at SSRN:https://ssrn.com/abstract=6168446. 9

work page

[9] [9]

Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond. Generative ai at work.Quarterly Journal of Economics, 140(2):889–942, 2025. doi: 10.1093/qje/qjae044

work page doi:10.1093/qje/qjae044 2025

[10] [10]

Gen-AI.Staff Discussion Notes, 2024(001):1, 1 2024

Mauro Cazzaniga. Gen-AI.Staff Discussion Notes, 2024(001):1, 1 2024. ISSN 2617-6750. doi: 10.5089/9798400262548.006. URLhttp://dx.doi.org/10.5089/9798400262548.006

work page doi:10.5089/9798400262548.006 2024

[11] [11]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024

[12] [12]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, 2024. URLhttps://aclanthology.org/2024.findings-acl.137/

work page 2024

[13] [13]

Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R

Fabrizio Dell’Acqua, Edward McFowland, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Working Paper 24-013, Harvard Business School, 2...

work page 2023

[14] [14]

Gpts are gpts: Labor market impact potential of llms.Science, 384(6702):1306–1308, 2024

Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: Labor market impact potential of llms.Science, 384(6702):1306–1308, 2024

work page 2024

[15] [15]

Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021

Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Management Journal, 42(12):2195–2217, 2021. doi: https://doi.org/10.1002/smj.3286. URL https://sms. onlinelibrary.wiley.com/doi/abs/10.1002/smj.3286

work page doi:10.1002/smj.3286 2021

[16] [16]

Felten, Manav Raj, and Robert Seamans

Edward W. Felten, Manav Raj, and Robert Seamans. Occupational heterogeneity in exposure to generative AI. Technical report, SSRN, April 2023. URL https://ssrn.com/abstract= 4414065. Available at SSRN

work page 2023

[17] [17]

Frequent use of ai in the workplace continued to rise in q4

Gallup. Frequent use of ai in the workplace continued to rise in q4. https://www. gallup.com/workplace/701195/frequent-workplace-continued-rise.aspx , 2026. Accessed 2026-04-17

work page 2026

[18] [18]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024. doi: 10.48550/arXiv.2312.10997. URLhttps://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024

[19] [19]

Gemma 4 model card, 2026

Gemma Team, Google DeepMind. Gemma 4 model card, 2026. URL https://ai.google. dev/gemma/docs/core/model_card_4. Accessed: April 21, 2026

work page 2026

[20] [20]

Generative AI and jobs: A refined global index of occupational exposure

Pawel Gmyrek, Janine Berg, Karol Kaminski, Filip Konopczy ´nski, Agnieszka Ładna, Balint Nafradi, Konrad Rosłaniec, and Marek Troszy ´nski. Generative AI and jobs: A refined global index of occupational exposure. Ilo research brief, International Labour Organization, Geneva, 2025. URL https://www.ilo.org/publications/ generative-ai-and-jobs-refined-global...

work page 2025

[21] [21]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 3–10, Virtual Event, Canada, 2021. Association for Computing Machinery. doi: 10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021

[22] [22]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024. 10

work page 2024

[23] [23]

Graham, F.Q

Rodney Michael Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Ba...

work page arXiv 2023

[24] [24]

Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022

work page 2022

[25] [25]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[26] [26]

Gdelt: Global data on events, language, and tone, 1979-

Kalev Leetaru and Philip Schrodt. Gdelt: Global data on events, language, and tone, 1979-

work page 1979

[27] [27]

URL https://www.gdeltproject.org/

InInternational Studies Association Annual Conference, San Francisco, CA, 2013. URL https://www.gdeltproject.org/

work page 2013

[28] [28]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

Making large language models a better foundation for dense retrieval, 2023

Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Making large language models a better foundation for dense retrieval, 2023

work page 2023

[30] [30]

Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey

Minghan Li, Xinxuan Lv, Junjie Zou, Tongna Chen, Chao Zhang, Suchao An, Ercong Nie, and Guodong Zhou. Query expansion in the age of pre-trained and large language models: A comprehensive survey.arXiv preprint arXiv:2509.07794, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

A technique for the measurement of attitudes.Archives of Psychology, 22(140): 55, 1932

Rensis Likert. A technique for the measurement of attitudes.Archives of Psychology, 22(140): 55, 1932

work page 1932

[32] [32]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

G-eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore,

work page 2023

[34] [34]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[35] [35]

Labor market impacts of ai: A new measure and early evidence.Anthropic Research, 5, 2026

Maxim Massenkoff and Peter McCrory. Labor market impacts of ai: A new measure and early evidence.Anthropic Research, 5, 2026

work page 2026

[36] [36]

O*net 30.2 database

National Center for O*NET Development. O*net 30.2 database. O*NET Resource Center,

work page

[37] [37]

URL https://www.onetcenter.org/database.html. U.S. Department of Labor, Employment and Training Administration (USDOL/ETA). Accessed 30 April 2026. Licensed under CC BY 4.0

work page 2026

[38] [38]

Lacking Control Increases Illusory Pattern Perception

Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial intelligence.Science, 381(6654):187–192, 2023. doi: 10.1126/science. adh2586. 11

work page doi:10.1126/science 2023

[39] [39]

OECD Publishing, Paris, 2023

OECD.OECD Employment Outlook 2023: Artificial Intelligence and the Labour Market. OECD Publishing, Paris, 2023. doi: 10.1787/08785bba-en. URL https://doi.org/10. 1787/08785bba-en

work page doi:10.1787/08785bba-en 2023

[40] [40]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 17 2026. Accessed: May 5, 2026

work page 2026

[41] [41]

Retrieval augmen- tation reduces hallucination in conversation, 2021

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmen- tation reduces hallucination in conversation, 2021. URL https://arxiv.org/abs/2104. 07567

work page 2021

[42] [42]

Labor market AI exposure: What do we know? Technical re- port, Yale Budget Lab, February 2026

The Budget Lab at Yale. Labor market AI exposure: What do we know? Technical re- port, Yale Budget Lab, February 2026. URL https://budgetlab.yale.edu/research/ labor-market-ai-exposure-what-do-we-know

work page 2026

[43] [44]

URLhttps://arxiv.org/abs/2507.07935

work page arXiv

[44] [45]

The impact of artificial intelligence on the labor market

Michael Webb. The impact of artificial intelligence on the labor market. SSRN Working Paper, November 2019. URLhttps://ssrn.com/abstract=3482150

work page 2019

[45] [46]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022

[46] [47]

The future of jobs report 2025

World Economic Forum. The future of jobs report 2025. Technical report, World Economic Forum, Geneva, Switzerland, January 2025. URL https://reports.weforum.org/docs/ WEF_Future_of_Jobs_Report_2025.pdf

work page 2025

[47] [48]

Llm-based agents for tool learning: A survey: W

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

work page 2025

[48] [49]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems,

work page

[50] [51]

URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html. A Rubric and Model Inputs This appendix provides supplementary material for the labeling framework, retrieval corpus con- struction, and evaluation procedures used in the main paper. We first document the exposure rubric ...

work page 2023

[51] [52]

Occupation Observed Theoretical AI exposure (%) Gemma Qwen Ministral Eloundou et al

for Gemma, Qwen and Ministral models. Occupation Observed Theoretical AI exposure (%) Gemma Qwen Ministral Eloundou et al. Computer Programmers 74.5% 88.3% 78.1% 88.7% 95.0% Customer Service Representatives 70.1% 50.0% 53.9% 87.9% 56.8% Data Entry Keyers 67.1% 51.5% 39.4% 48.5% 89.3% Medical Records Specialists 66.7% 47.1% 58.8% 58.8% 61.8% Market Researc...

work page 2026