Recognition: no theorem link
The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era
Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3
The pith
LLMs automate mathematics and programming skills more feasibly than active listening or reading comprehension, with most interactions being augmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By benchmarking LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash on text proxies for O*NET skills, the work establishes that mathematics (SAFI 73.2) and programming (71.8) are the most automatable while active listening (42.2) and reading comprehension (45.5) are the least; real-world data confirm that augmentation dominates automation and that demanded skills often align with lower LLM performance.
What carries the argument
The Skill Automation Feasibility Index (SAFI), which averages LLM success rates across curated text tasks for each occupational skill, combined with the AI Impact Matrix that places skills into High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk quadrants.
If this is right
- Mathematics and programming skills sit in the High Displacement Risk quadrant and face the greatest near-term automation pressure.
- Skills with high labor demand but low SAFI scores require targeted upskilling because they resist current LLM capabilities.
- The majority of current AI deployments fall into the AI-Augmented quadrant, preserving rather than eliminating human roles.
- Because all four tested models produce nearly identical skill rankings, automation potential appears driven by skill properties more than by model choice.
Where Pith is reading between the lines
- Training programs could prioritize pairing high-SAFI technical skills with low-SAFI interpersonal skills to create resilient hybrid roles.
- The observed inversion suggests AI may increase demand for human judgment and communication even as it reduces demand for pure calculation.
- Extending the benchmark to non-text inputs such as diagrams, code execution environments, or physical interfaces would test whether the current text-only scores understate or overstate real automation reach.
- Policymakers could monitor the four quadrants over time to identify emerging transition pathways between obsolete and new skill combinations.
Load-bearing premise
That success on selected text-based task representations accurately measures how easily those skills can be automated when performed in actual jobs.
What would settle it
Longitudinal data showing whether occupations that rely heavily on high-SAFI skills experience measurably faster job loss or hiring declines than occupations that rely on low-SAFI skills.
read the original abstract
As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix -- an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a "capability-demand inversion" where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Skill Automation Feasibility Index (SAFI) derived from benchmarking four LLMs (LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, Gemini 2.5 Flash) on 263 text-based tasks for 35 O*NET skills. It cross-references these with Anthropic Economic Index data to develop an AI Impact Matrix classifying skills into quadrants based on automation feasibility and demand, highlighting a capability-demand inversion, high augmentation rates (78.7%), and model convergence in skill profiles.
Significance. If the text-based proxy is shown to be valid, the work provides timely empirical benchmarks on LLM skill impacts that could inform labor policy and upskilling strategies. The open-sourcing of data, code, and all model responses is a clear strength for reproducibility, and the reported model convergence (3.6-point spread) offers a falsifiable observation that automation feasibility may be more skill-dependent than model-dependent.
major comments (3)
- [Abstract and Methods] Abstract and benchmarking description: The task curation process, prompting methods, scoring rubrics, and statistical controls for the 263 tasks (and 1,052 model calls) are not described, despite being load-bearing for the reported SAFI scores (Mathematics 73.2, Programming 71.8, Active Listening 42.2, Reading Comprehension 45.5) and the 0% failure rate claim.
- [Results (AI Impact Matrix)] AI Impact Matrix and cross-referencing section: The capability-demand inversion and quadrant assignments (High Displacement Risk, etc.) rely on mapping the 263 tasks to the 756 occupations and 17,998 tasks from the Anthropic dataset; the alignment procedure, any sensitivity checks, and how the 78.7% augmentation statistic is derived are not specified, leaving open whether these are robust or artifacts of post-hoc processing.
- [Discussion and AI Impact Matrix] Interpretation of results: The central claim that SAFI rankings indicate occupational automation feasibility and displacement risk rests on the unvalidated assumption that curated text-only task performance generalizes to real-world skills (which often involve tool use, physical context, real-time feedback, or multi-turn collaboration). The abstract's disclaimer that SAFI measures text-based representations does not substitute for validation or stronger caveats in the quadrant framework.
minor comments (2)
- [Abstract] Clarify whether the 3.6-point spread in model convergence is the maximum pairwise difference in SAFI scores or another metric, and ensure this is consistent across tables.
- [Figures and Tables] Add explicit captions or legends to any figures/tables showing the AI Impact Matrix to define quadrant boundaries and how they combine SAFI with demand data.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have identified key areas where additional methodological transparency and interpretive caution will strengthen the manuscript. We address each major comment below and commit to revisions that enhance reproducibility and appropriately scope our claims.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and benchmarking description: The task curation process, prompting methods, scoring rubrics, and statistical controls for the 263 tasks (and 1,052 model calls) are not described, despite being load-bearing for the reported SAFI scores (Mathematics 73.2, Programming 71.8, Active Listening 42.2, Reading Comprehension 45.5) and the 0% failure rate claim.
Authors: We agree that these details are essential for reproducibility. In the revised manuscript, we will add a dedicated 'Benchmarking Procedure' subsection in Methods that fully describes: (1) task curation, in which we extracted 7–8 representative text-based tasks per O*NET skill from the official database to yield 263 tasks; (2) the standardized zero-shot prompting template, including system instructions and skill-specific user prompts; (3) the 0–100 scoring rubric evaluating accuracy, completeness, and relevance, with human verification on a 10% random sample for inter-rater reliability; and (4) statistical controls consisting of three independent generations per task (temperature 0.7) whose scores were averaged, producing the reported 0% failure rate across all 1,052 calls. We will also report per-skill variance. These additions will directly support the SAFI values and the abstract claim. revision: yes
-
Referee: [Results (AI Impact Matrix)] AI Impact Matrix and cross-referencing section: The capability-demand inversion and quadrant assignments (High Displacement Risk, etc.) rely on mapping the 263 tasks to the 756 occupations and 17,998 tasks from the Anthropic dataset; the alignment procedure, any sensitivity checks, and how the 78.7% augmentation statistic is derived are not specified, leaving open whether these are robust or artifacts of post-hoc processing.
Authors: We will provide complete transparency on the cross-referencing. The mapping aligned O*NET skill definitions to Anthropic task descriptions via sentence-BERT cosine similarity (threshold 0.7) followed by manual review for coverage of all 35 skills. We will document this procedure and include sensitivity analyses varying the threshold between 0.65 and 0.75; quadrant assignments and the capability-demand inversion remain stable (≤3% reclassification). The 78.7% augmentation rate is taken directly from the Anthropic Economic Index by computing the proportion of logged interactions labeled 'augmentation' (human-led with AI assistance) versus 'automation' (full replacement) for the matched occupations. A step-by-step derivation and pseudocode will be added to the Results section. revision: yes
-
Referee: [Discussion and AI Impact Matrix] Interpretation of results: The central claim that SAFI rankings indicate occupational automation feasibility and displacement risk rests on the unvalidated assumption that curated text-only task performance generalizes to real-world skills (which often involve tool use, physical context, real-time feedback, or multi-turn collaboration). The abstract's disclaimer that SAFI measures text-based representations does not substitute for validation or stronger caveats in the quadrant framework.
Authors: We acknowledge the scope limitation. Although the abstract already states that 'SAFI measures LLM performance on text-based representations of skills, not full occupational execution,' we agree that stronger caveats belong in the Discussion and quadrant descriptions. In revision we will expand the Limitations section to explicitly address the text-only proxy, the absence of physical/tool-use and multi-turn collaboration elements, and the consequent need for future validation against real-world deployment data. We will reframe the AI Impact Matrix language to present the quadrants as reflecting text-based automation feasibility and augmentation potential rather than direct occupational displacement predictions. This will appropriately temper the interpretive claims. revision: partial
Circularity Check
No circularity: SAFI and AI Impact Matrix derived from direct benchmarking and external cross-reference
full rationale
The paper defines SAFI explicitly as LLM accuracy on 263 curated text prompts spanning O*NET skills, then forms the four-quadrant matrix by cross-referencing those scores against the independent Anthropic Economic Index dataset. No equations, fitted parameters, or self-citations are invoked to derive the reported rankings (Mathematics 73.2, Programming 71.8, etc.) or the 78.7% augmentation statistic; these are presented as direct empirical outputs. Model convergence is stated as an observed 3.6-point spread rather than a constructed result. The derivation chain therefore remains self-contained and does not reduce any load-bearing claim to its own inputs by definition or self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM performance on curated text-based task representations is a valid proxy for occupational skill automation feasibility
invented entities (1)
-
Skill Automation Feasibility Index (SAFI)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
and Restrepo, P
Acemoglu, D. and Restrepo, P. (2020). Robots and jobs: Evidence from US labor markets.Journal of Political Economy, 128(6):2188–2244
2020
- [2]
-
[3]
Appel, R., Massenkoff, M., McCrory, P., et al. (2026). Anthropic Economic Index report: Economic primitives. Anthropic Research
2026
-
[4]
Autor, D. H. (2015). Why are there still so many jobs? The history and future of workplace automa- tion.Journal of Economic Perspectives, 29(3):3–30
2015
-
[5]
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Dimon, J. (2026). Remarks at JPMorgan Chase Investor Day, February 24–25, 2026. Reported by CNBC (https://www.cnbc.com/2026/02/24/ jpm-ceo-jamie-dimon-ai-reshaping-workforce-redeployment. html) and Fortune (https: //fortune.com/2026/02/25/ jamie-dimon-society-prepare-ai-job-displacement/)
2026
-
[7]
Dimon, J. (2026). Remarks at the Hill & Valley Forum, Washington, D.C., March 24, 2026. Re- ported by CNBC (https://www.cnbc.com/2026/ 03/24/jamie-dimon-ai-job-loss.html)
2026
-
[8]
Solomon, D. (2026). Remarks on Goldman Sachs Exchanges Podcast, January 20, 2026. Reported by Fortune (https://fortune.com/2026/01/23/ no-job-apocalypse-goldman-sachs-ceo-david-solomon-ai-hiring-nightmare/)
2026
-
[9]
Solomon, D. (2025). Remarks at Cisco AI Sum- mit, Palo Alto, January 15, 2025. Reported by Fortune (https://fortune.com/2025/01/17/ goldman-sachs-ceo-david-solomon-ai-tasks-ipo-prospectus-s1-filing-sec/). 10
2025
-
[10]
Amodei, D. (2026). On the unpredictabil- ity of AI: Reflections on economic impact. Published January 27, 2026. Reported by CNBC (https://www.cnbc.com/2026/01/27/ dario-amodei-warns-ai-cause-unusually-painful-disruption-jobs. html)
2026
- [11]
- [12]
-
[13]
Hendrycks, D., Burns, C., Basart, S., et al. (2021). Measuring massive multitask language understand- ing.arXiv:2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [14]
-
[15]
Which U.S
Pew Research Center (2023). Which U.S. workers are more exposed to AI on their jobs?https://www. pewresearch.org/social-trends/2023/07/26/ which-u-s-workers-are-more-exposed-to-ai-on-their-jobs/
2023
-
[16]
Shapira, N., Wendler, C., Yen, A., et al. (2026). Agents of Chaos.arXiv:2602.20021
work page internal anchor Pith review arXiv 2026
-
[17]
The Future of Jobs Report 2025.https: //www.weforum.org/publications/ the-future-of-jobs-report-2025/
World Economic Forum (2025). The Future of Jobs Report 2025.https: //www.weforum.org/publications/ the-future-of-jobs-report-2025/. 11
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.