arxiv: 2604.06906 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.AI· cs.CY

Recognition: no theorem link

The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era

Rudra Jadhav , Janhavi Danve

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords AI automationoccupational skillsLLM benchmarkingO*NET taxonomyskill augmentationlabor market impactskill transition pathwayscapability-demand inversion

0 comments

The pith

LLMs automate mathematics and programming skills more feasibly than active listening or reading comprehension, with most interactions being augmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the Skill Automation Feasibility Index by running four frontier LLMs on 263 text-based tasks that represent all 35 skills in the O*NET occupational taxonomy. It finds the highest automation scores for mathematics and programming and the lowest for active listening and reading comprehension. Cross-referencing these scores with real adoption data shows a capability-demand inversion, where the skills most needed in AI-exposed jobs are the ones models handle least well, and that 78.7 percent of observed AI use augments rather than replaces human work. The authors introduce an AI Impact Matrix with four quadrants to classify skills and argue that automation feasibility depends more on the skill than on which specific LLM is used.

Core claim

By benchmarking LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash on text proxies for O*NET skills, the work establishes that mathematics (SAFI 73.2) and programming (71.8) are the most automatable while active listening (42.2) and reading comprehension (45.5) are the least; real-world data confirm that augmentation dominates automation and that demanded skills often align with lower LLM performance.

What carries the argument

The Skill Automation Feasibility Index (SAFI), which averages LLM success rates across curated text tasks for each occupational skill, combined with the AI Impact Matrix that places skills into High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk quadrants.

If this is right

Mathematics and programming skills sit in the High Displacement Risk quadrant and face the greatest near-term automation pressure.
Skills with high labor demand but low SAFI scores require targeted upskilling because they resist current LLM capabilities.
The majority of current AI deployments fall into the AI-Augmented quadrant, preserving rather than eliminating human roles.
Because all four tested models produce nearly identical skill rankings, automation potential appears driven by skill properties more than by model choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training programs could prioritize pairing high-SAFI technical skills with low-SAFI interpersonal skills to create resilient hybrid roles.
The observed inversion suggests AI may increase demand for human judgment and communication even as it reduces demand for pure calculation.
Extending the benchmark to non-text inputs such as diagrams, code execution environments, or physical interfaces would test whether the current text-only scores understate or overstate real automation reach.
Policymakers could monitor the four quadrants over time to identify emerging transition pathways between obsolete and new skill combinations.

Load-bearing premise

That success on selected text-based task representations accurately measures how easily those skills can be automated when performed in actual jobs.

What would settle it

Longitudinal data showing whether occupations that rely heavily on high-SAFI skills experience measurably faster job loss or hiring declines than occupations that rely on low-SAFI skills.

read the original abstract

As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix -- an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a "capability-demand inversion" where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scores 35 O*NET skills by how four LLMs handle 263 text prompts, finds math and programming easiest, and layers the results onto Anthropic usage data for a four-quadrant matrix, but the automation-risk claims rest on an unvalidated text proxy.

read the letter

The one thing to know is that this work creates a new numerical score for each of the 35 O*NET skills based on how well current LLMs handle text versions of tasks tied to those skills. Mathematics tops the list at 73.2 and programming at 71.8, while active listening sits at 42.2. They then map these scores against real AI adoption patterns from the Anthropic index to define four quadrants of impact and report that 78.7 percent of observed interactions are augmentation rather than replacement. All four models give nearly the same skill ordering, which is a clean observation.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Skill Automation Feasibility Index (SAFI) derived from benchmarking four LLMs (LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, Gemini 2.5 Flash) on 263 text-based tasks for 35 O*NET skills. It cross-references these with Anthropic Economic Index data to develop an AI Impact Matrix classifying skills into quadrants based on automation feasibility and demand, highlighting a capability-demand inversion, high augmentation rates (78.7%), and model convergence in skill profiles.

Significance. If the text-based proxy is shown to be valid, the work provides timely empirical benchmarks on LLM skill impacts that could inform labor policy and upskilling strategies. The open-sourcing of data, code, and all model responses is a clear strength for reproducibility, and the reported model convergence (3.6-point spread) offers a falsifiable observation that automation feasibility may be more skill-dependent than model-dependent.

major comments (3)

[Abstract and Methods] Abstract and benchmarking description: The task curation process, prompting methods, scoring rubrics, and statistical controls for the 263 tasks (and 1,052 model calls) are not described, despite being load-bearing for the reported SAFI scores (Mathematics 73.2, Programming 71.8, Active Listening 42.2, Reading Comprehension 45.5) and the 0% failure rate claim.
[Results (AI Impact Matrix)] AI Impact Matrix and cross-referencing section: The capability-demand inversion and quadrant assignments (High Displacement Risk, etc.) rely on mapping the 263 tasks to the 756 occupations and 17,998 tasks from the Anthropic dataset; the alignment procedure, any sensitivity checks, and how the 78.7% augmentation statistic is derived are not specified, leaving open whether these are robust or artifacts of post-hoc processing.
[Discussion and AI Impact Matrix] Interpretation of results: The central claim that SAFI rankings indicate occupational automation feasibility and displacement risk rests on the unvalidated assumption that curated text-only task performance generalizes to real-world skills (which often involve tool use, physical context, real-time feedback, or multi-turn collaboration). The abstract's disclaimer that SAFI measures text-based representations does not substitute for validation or stronger caveats in the quadrant framework.

minor comments (2)

[Abstract] Clarify whether the 3.6-point spread in model convergence is the maximum pairwise difference in SAFI scores or another metric, and ensure this is consistent across tables.
[Figures and Tables] Add explicit captions or legends to any figures/tables showing the AI Impact Matrix to define quadrant boundaries and how they combine SAFI with demand data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have identified key areas where additional methodological transparency and interpretive caution will strengthen the manuscript. We address each major comment below and commit to revisions that enhance reproducibility and appropriately scope our claims.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and benchmarking description: The task curation process, prompting methods, scoring rubrics, and statistical controls for the 263 tasks (and 1,052 model calls) are not described, despite being load-bearing for the reported SAFI scores (Mathematics 73.2, Programming 71.8, Active Listening 42.2, Reading Comprehension 45.5) and the 0% failure rate claim.

Authors: We agree that these details are essential for reproducibility. In the revised manuscript, we will add a dedicated 'Benchmarking Procedure' subsection in Methods that fully describes: (1) task curation, in which we extracted 7–8 representative text-based tasks per O*NET skill from the official database to yield 263 tasks; (2) the standardized zero-shot prompting template, including system instructions and skill-specific user prompts; (3) the 0–100 scoring rubric evaluating accuracy, completeness, and relevance, with human verification on a 10% random sample for inter-rater reliability; and (4) statistical controls consisting of three independent generations per task (temperature 0.7) whose scores were averaged, producing the reported 0% failure rate across all 1,052 calls. We will also report per-skill variance. These additions will directly support the SAFI values and the abstract claim. revision: yes
Referee: [Results (AI Impact Matrix)] AI Impact Matrix and cross-referencing section: The capability-demand inversion and quadrant assignments (High Displacement Risk, etc.) rely on mapping the 263 tasks to the 756 occupations and 17,998 tasks from the Anthropic dataset; the alignment procedure, any sensitivity checks, and how the 78.7% augmentation statistic is derived are not specified, leaving open whether these are robust or artifacts of post-hoc processing.

Authors: We will provide complete transparency on the cross-referencing. The mapping aligned O*NET skill definitions to Anthropic task descriptions via sentence-BERT cosine similarity (threshold 0.7) followed by manual review for coverage of all 35 skills. We will document this procedure and include sensitivity analyses varying the threshold between 0.65 and 0.75; quadrant assignments and the capability-demand inversion remain stable (≤3% reclassification). The 78.7% augmentation rate is taken directly from the Anthropic Economic Index by computing the proportion of logged interactions labeled 'augmentation' (human-led with AI assistance) versus 'automation' (full replacement) for the matched occupations. A step-by-step derivation and pseudocode will be added to the Results section. revision: yes
Referee: [Discussion and AI Impact Matrix] Interpretation of results: The central claim that SAFI rankings indicate occupational automation feasibility and displacement risk rests on the unvalidated assumption that curated text-only task performance generalizes to real-world skills (which often involve tool use, physical context, real-time feedback, or multi-turn collaboration). The abstract's disclaimer that SAFI measures text-based representations does not substitute for validation or stronger caveats in the quadrant framework.

Authors: We acknowledge the scope limitation. Although the abstract already states that 'SAFI measures LLM performance on text-based representations of skills, not full occupational execution,' we agree that stronger caveats belong in the Discussion and quadrant descriptions. In revision we will expand the Limitations section to explicitly address the text-only proxy, the absence of physical/tool-use and multi-turn collaboration elements, and the consequent need for future validation against real-world deployment data. We will reframe the AI Impact Matrix language to present the quadrants as reflecting text-based automation feasibility and augmentation potential rather than direct occupational displacement predictions. This will appropriately temper the interpretive claims. revision: partial

Circularity Check

0 steps flagged

No circularity: SAFI and AI Impact Matrix derived from direct benchmarking and external cross-reference

full rationale

The paper defines SAFI explicitly as LLM accuracy on 263 curated text prompts spanning O*NET skills, then forms the four-quadrant matrix by cross-referencing those scores against the independent Anthropic Economic Index dataset. No equations, fitted parameters, or self-citations are invoked to derive the reported rankings (Mathematics 73.2, Programming 71.8, etc.) or the 78.7% augmentation statistic; these are presented as direct empirical outputs. Model convergence is stated as an observed 3.6-point spread rather than a constructed result. The derivation chain therefore remains self-contained and does not reduce any load-bearing claim to its own inputs by definition or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about text proxies and introduces one new measurement construct without independent falsifiable evidence beyond the benchmark itself.

axioms (1)

domain assumption LLM performance on curated text-based task representations is a valid proxy for occupational skill automation feasibility
Stated limitation in abstract; underpins all SAFI scores and quadrant placement.

invented entities (1)

Skill Automation Feasibility Index (SAFI) no independent evidence
purpose: Quantify and rank automation feasibility of O*NET skills via LLM benchmarks
Newly constructed composite score from 263 tasks; no external validation cited.

pith-pipeline@v0.9.0 · 5614 in / 1439 out tokens · 43181 ms · 2026-05-10T18:19:20.865706+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages · 3 internal anchors

[1]

and Restrepo, P

Acemoglu, D. and Restrepo, P. (2020). Robots and jobs: Evidence from US labor markets.Journal of Political Economy, 128(6):2188–2244

2020
[2]

Appel, R., McCrory, P., and Tamkin, A. (2025). An- thropic Economic Index report: Uneven geographic and enterprise AI adoption.arXiv:2511.15080

work page arXiv 2025
[3]

Appel, R., Massenkoff, M., McCrory, P., et al. (2026). Anthropic Economic Index report: Economic primitives. Anthropic Research

2026
[4]

Autor, D. H. (2015). Why are there still so many jobs? The history and future of workplace automa- tion.Journal of Economic Perspectives, 29(3):3–30

2015
[5]

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Dimon, J. (2026). Remarks at JPMorgan Chase Investor Day, February 24–25, 2026. Reported by CNBC (https://www.cnbc.com/2026/02/24/ jpm-ceo-jamie-dimon-ai-reshaping-workforce-redeployment. html) and Fortune (https: //fortune.com/2026/02/25/ jamie-dimon-society-prepare-ai-job-displacement/)

2026
[7]

Dimon, J. (2026). Remarks at the Hill & Valley Forum, Washington, D.C., March 24, 2026. Re- ported by CNBC (https://www.cnbc.com/2026/ 03/24/jamie-dimon-ai-job-loss.html)

2026
[8]

Solomon, D. (2026). Remarks on Goldman Sachs Exchanges Podcast, January 20, 2026. Reported by Fortune (https://fortune.com/2026/01/23/ no-job-apocalypse-goldman-sachs-ceo-david-solomon-ai-hiring-nightmare/)

2026
[9]

Solomon, D. (2025). Remarks at Cisco AI Sum- mit, Palo Alto, January 15, 2025. Reported by Fortune (https://fortune.com/2025/01/17/ goldman-sachs-ceo-david-solomon-ai-tasks-ipo-prospectus-s1-filing-sec/). 10

2025
[10]

Amodei, D. (2026). On the unpredictabil- ity of AI: Reflections on economic impact. Published January 27, 2026. Reported by CNBC (https://www.cnbc.com/2026/01/27/ dario-amodei-warns-ai-cause-unusually-painful-disruption-jobs. html)

2026
[11]

Eloundou, T., Manning, S., Mishkin, P., and Rock, D. (2023). GPTs are GPTs: An early look at the la- bor market impact potential of large language mod- els.arXiv:2303.10130

work page arXiv 2023
[12]

Handa, K., Tamkin, A., et al. (2025). Which eco- nomic tasks are performed with AI? Evidence from millions of Claude conversations.arXiv:2503.04761

work page arXiv 2025
[13]

Hendrycks, D., Burns, C., Basart, S., et al. (2021). Measuring massive multitask language understand- ing.arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Jadhav, R., Danve, J., and Shaw, S. (2026). Implicit grading bias in large language models. arXiv:2603.18765

work page arXiv 2026
[15]

Which U.S

Pew Research Center (2023). Which U.S. workers are more exposed to AI on their jobs?https://www. pewresearch.org/social-trends/2023/07/26/ which-u-s-workers-are-more-exposed-to-ai-on-their-jobs/

2023
[16]

Shapira, N., Wendler, C., Yen, A., et al. (2026). Agents of Chaos.arXiv:2602.20021

work page internal anchor Pith review arXiv 2026
[17]

The Future of Jobs Report 2025.https: //www.weforum.org/publications/ the-future-of-jobs-report-2025/

World Economic Forum (2025). The Future of Jobs Report 2025.https: //www.weforum.org/publications/ the-future-of-jobs-report-2025/. 11

2025