Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

Cristian Espinal Maya

arxiv: 2604.02403 · v1 · submitted 2026-04-02 · 💰 econ.EM · cs.CL· stat.ME

Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

Cristian Espinal Maya This is my paper

Pith reviewed 2026-05-13 20:27 UTC · model grok-4.3

classification 💰 econ.EM cs.CLstat.ME

keywords large language modelsinstrumental variableslatent cognitive variableslabor economicshuman capital indexAI exposureO*NET tasksmeasurement error

0 comments

The pith

Large language models can serve as valid instruments for latent cognitive variables in occupational tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how large language models can quantify the cognitive content of job tasks in ways that traditional surveys cannot reach. It sets out four conditions under which LLM-generated scores qualify as valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. The author scores nearly 19,000 O*NET task statements with Claude Haiku 4.5 to build the Augmented Human Capital Index, then validates it against existing AI exposure measures. The index separates augmentation from substitution effects and, when used in instrumental variable regressions, produces effect sizes 25 percent larger than ordinary least squares, consistent with correcting measurement error. The same logic applies to measuring semantic content at scale in any domain that relies on text descriptions of tasks or activities.

Core claim

Under the conditions of semantic exogeneity, construct relevance, monotonicity, and model invariance, LLM-generated scores from occupational task statements constitute valid instruments for latent cognitive variables, as shown by the Augmented Human Capital Index constructed from 18,796 O*NET tasks that exhibits convergent validity with existing indices, discriminant validity across augmentation and substitution dimensions, and yields larger coefficients in Obviously Related Instrumental Variables estimation than OLS.

What carries the argument

The Augmented Human Capital Index (AHC_o) constructed by scoring O*NET task statements with Claude Haiku 4.5, which serves as an instrument for latent cognitive task content.

If this is right

Labor models can recover larger effects of cognitive skills on wages and employment once measurement error is addressed with the index.
AI exposure in occupations splits into two distinct dimensions of augmentation and substitution rather than a single scale.
Inter-model reliability and prompt robustness checks support practical use of the scoring method across models.
The framework scales to quantify any semantic features in large text corpora of job or activity descriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Occupational skill measures could be refreshed regularly as new task data appears or as LLMs improve, allowing dynamic tracking of cognitive demands.
The separation of augmentation and substitution may help target training or adjustment policies toward specific types of AI impact.
The same scoring approach could be tested on latent traits such as creativity or social skills in other text-based datasets.
Global application would require checking whether the validity conditions hold across languages and cultural contexts in labor data.

Load-bearing premise

The four conditions hold for the chosen LLM when applied to O*NET task statements, so that the generated scores capture the intended latent cognitive variables without unmeasured biases.

What would settle it

If the Augmented Human Capital Index shows low correlation with established AI exposure indices or if scores from different LLMs produce inconsistent task rankings that violate model invariance or monotonicity.

Figures

Figures reproduced from arXiv: 2604.02403 by Cristian Espinal Maya.

**Figure 2.** Figure 2: Pairwise Pearson correlations across 11 AI exposure indices at the 6-digit SOC level [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Bland–Altman agreement plot for Haiku vs. Sonnet augmentation scores ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Three-way inter-model agreement on augmentation scores. Left: Haiku vs. Sonnet [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt sensitivity analysis. Left: Spearman rank correlations across four prompt [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Measurement error correction: OLS underestimates the AHC coefficient due to at [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Incremental R2 : AHC and Frey–Osborne capture complementary information. Together, they explain 0.86 percentage points more wage variance than controls alone. • Policy evaluation: scoring policy documents on innovation-orientation, regulatory burden, or human-centricity (e.g., evaluating whether Colombia’s CONPES 4144 AI policy targets augmentable vs. routine human capital development). • Judicial reasoni… view at source ↗

read the original abstract

This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes four conditions for LLMs as instruments on occupational tasks and builds a new index with convergent checks, but the key exogeneity and monotonicity claims rest on untested assumptions versus human ratings.

read the letter

The main takeaway is that Espinal Maya has written down four explicit conditions—semantic exogeneity, construct relevance, monotonicity, and model invariance—for treating LLM scores as instruments for latent cognitive content in job tasks, then used Claude Haiku to score nearly 19,000 O*NET statements into the Augmented Human Capital Index. The index lines up with prior measures at r=0.85 and r=0.79, shows decent cross-model agreement, and produces larger ORIV coefficients than OLS, which fits the measurement-error story. The PCA split into augmentation and substitution dimensions is also a clean addition. Those pieces are useful and straightforward to follow. The softer spot is that monotonicity and semantic exogeneity are not checked against independent human ratings of the same task statements. The reported validations stay within other indices and LLM outputs, so any shared training-data patterns or prompt artifacts could still be driving the results rather than isolating the intended latent variables. Prompt sensitivity and inter-model reliability help, but they do not close that gap. This is the kind of paper labor economists working on AI exposure, skills measurement, or automation would want to see, because it offers a scalable alternative to surveys. The framework itself is new enough and the application concrete enough that it deserves referee time rather than a desk reject; the reviewers can push on the human-ground-truth tests and any data or prompt details that are still thin in the current version.

Referee Report

3 major / 3 minor

Summary. The paper claims that LLMs can serve as valid instruments for latent cognitive variables in labor economics by formalizing four conditions (semantic exogeneity, construct relevance, monotonicity, model invariance) and applying them to construct the Augmented Human Capital Index (AHC_o) from 18,796 O*NET task statements scored by Claude Haiku 4.5. It reports strong convergent validity (r=0.85 with Eloundou GPT-gamma, r=0.79 with Felten AIOE), discriminant validity, PCA identifying augmentation and substitution dimensions, inter-model reliability (Pearson r=0.76, Krippendorff alpha=0.71), robust prompt rankings, and ORIV coefficients 25% larger than OLS, consistent with measurement error correction. The approach is positioned as generalizable beyond labor economics.

Significance. If the four conditions hold, the work offers a scalable method to quantify fine-grained semantic content of occupational tasks beyond the limits of traditional surveys, with direct implications for research on AI exposure, skill measurement, and human capital. The empirical components—convergent correlations, reliability metrics, two-dimensional PCA structure, and the ORIV attenuation correction—provide concrete support for practical utility and reproducibility when code and prompts are shared.

major comments (3)

[Theoretical Framework and Validation] The central claim that LLM-generated scores constitute valid instruments rests on the four conditions, but monotonicity and semantic exogeneity receive no direct test against independent human ratings of the latent cognitive constructs on O*NET tasks. Only construct relevance and model invariance are supported via the reported correlations (r=0.85, r=0.79) and inter-model r=0.76; without human-ground-truth monotonicity checks, the ORIV correction may not be valid.
[Empirical Application] The Augmented Human Capital Index (AHC_o) construction from 18,796 statements is load-bearing for all validity claims, yet the manuscript provides no details on the exact scoring prompt, any preprocessing or exclusions, or how scores are aggregated into the index; this prevents verification that the reported convergent validity is not driven by prompt-induced artifacts.
[ORIV Estimation Results] The ORIV result recovering coefficients 25% larger than OLS is presented as evidence of classical measurement error correction, but without the specific instrument construction, error structure assumptions, or robustness checks to alternative specifications, it is unclear whether the 25% figure is robust or sensitive to modeling choices.

minor comments (3)

[Methods] The prompt sensitivity analysis across four alternative framings is mentioned but does not list the exact framings or report the full ranking correlations, which would strengthen the robustness claim.
[Reliability Analysis] The sample size for inter-rater reliability (n=3,666 paired scores) is given, but the selection rule for which tasks were double-scored is not stated.
[Validation] The abstract references validation against six existing AI exposure indices, but only two specific correlations are reported; a table listing all six would improve clarity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas for clarification and improvement. We address each major comment point by point below, providing honest explanations grounded in the manuscript's framework and data. Revisions will be incorporated where feasible to enhance transparency and robustness without overstating the current evidence.

read point-by-point responses

Referee: [Theoretical Framework and Validation] The central claim that LLM-generated scores constitute valid instruments rests on the four conditions, but monotonicity and semantic exogeneity receive no direct test against independent human ratings of the latent cognitive constructs on O*NET tasks. Only construct relevance and model invariance are supported via the reported correlations (r=0.85, r=0.79) and inter-model r=0.76; without human-ground-truth monotonicity checks, the ORIV correction may not be valid.

Authors: We agree that direct empirical tests of monotonicity and semantic exogeneity against independent human ratings are not present in the current manuscript. The conditions are justified theoretically: semantic exogeneity is ensured by prompt instructions that direct scoring exclusively from task semantic content without referencing labor outcomes, while monotonicity follows from LLMs' training on human text corpora that encode cognitive demand semantics. Construct relevance and model invariance receive empirical support from the reported correlations and reliability metrics. We will add an expanded discussion subsection in the revised manuscript elaborating these theoretical justifications and proposing a protocol for future human validation studies. This addresses the concern by clarifying assumptions without claiming direct human tests. revision: partial
Referee: [Empirical Application] The Augmented Human Capital Index (AHC_o) construction from 18,796 statements is load-bearing for all validity claims, yet the manuscript provides no details on the exact scoring prompt, any preprocessing or exclusions, or how scores are aggregated into the index; this prevents verification that the reported convergent validity is not driven by prompt-induced artifacts.

Authors: We concur that detailed construction information is necessary for reproducibility and to rule out artifacts. The revised manuscript will include the full exact scoring prompt, a description of preprocessing (limited to standard O*NET task statements with no additional exclusions), and the aggregation method (task-level scores averaged per occupation, incorporating O*NET importance weights where available). We have verified that convergent validity persists under the alternative prompts already analyzed in the sensitivity checks, and these details will be added to the main text or appendix. revision: yes
Referee: [ORIV Estimation Results] The ORIV result recovering coefficients 25% larger than OLS is presented as evidence of classical measurement error correction, but without the specific instrument construction, error structure assumptions, or robustness checks to alternative specifications, it is unclear whether the 25% figure is robust or sensitive to modeling choices.

Authors: The ORIV approach follows the standard implementation using two LLM-generated scores as obviously related instruments, per the Gillen et al. (2019) framework, under the assumption of classical measurement error. The 25% larger coefficients are from the primary wage/employment specifications. In the revision, we will explicitly detail the instrument construction (pairing Claude Haiku scores with a second model), state the error assumptions, and add robustness tables in the appendix covering alternative instrument pairings and specifications. These additional checks confirm the magnitude is stable. revision: yes

standing simulated objections not resolved

Direct human-ground-truth ratings for monotonicity and semantic exogeneity tests on the full set of 18,796 O*NET tasks, as this requires new primary data collection outside the scope of the current study.

Circularity Check

0 steps flagged

No circularity: LLM instrument conditions validated via external indices and standard methods

full rationale

The paper defines four conditions for valid LLM instruments and constructs the AHC_o index by scoring O*NET tasks with Claude Haiku 4.5. All load-bearing empirical steps rely on external benchmarks (convergent correlations r=0.85/0.79 with Eloundou/Felten indices, PCA for two dimensions, inter-model r=0.76, ORIV vs OLS comparison) and standard techniques rather than any self-definitional reduction, fitted parameter renamed as prediction, or self-citation chain. The central claims do not collapse into the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the four conditions ensure instrument validity for LLM scores and on the introduction of the AHC_o as a new constructed index without independent falsifiable evidence beyond reported correlations.

axioms (1)

domain assumption LLM-generated scores satisfy semantic exogeneity, construct relevance, monotonicity, and model invariance for occupational task statements
These four conditions are formalized in the paper as necessary and sufficient for the scores to constitute valid instruments.

invented entities (1)

Augmented Human Capital Index (AHC_o) no independent evidence
purpose: Measure latent cognitive content of occupational tasks at fine granularity using LLM scores from O*NET statements
New index constructed from 18,796 task statements scored by Claude Haiku 4.5; no independent evidence provided beyond internal validations.

pith-pipeline@v0.9.0 · 5551 in / 1385 out tokens · 55046 ms · 2026-05-13T20:27:09.749348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Gpts are gpts: An early look at the labor market impact potential of large language models.Science, 384(6702): 1306–1308, 2024

Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models.Science, 384(6702): 1306–1308, 2024

work page 2024
[2]

Augmented human capital: A unified theory and llm-based measure- ment framework for cognitive factor decomposition in ai-augmented economies

Cristian Espinal Maya. Augmented human capital: A unified theory and llm-based measure- ment framework for cognitive factor decomposition in ai-augmented economies. Technical report, Universidad EAFIT, 2026. arXiv preprint

work page 2026
[3]

Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Manage- ment Journal, 42(12):2195–2217, 2021

Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Manage- ment Journal, 42(12):2195–2217, 2021

work page 2021
[4]

Experimenting with measurement error: Tech- niques with applications to the caltech cohort study.Journal of Political Economy, 127(4): 1826–1863, 2019

Ben Gillen, Erik Snowberg, and Leeat Yariv. Experimenting with measurement error: Tech- niques with applications to the caltech cohort study.Journal of Political Economy, 127(4): 1826–1863, 2019

work page 2019
[5]

Computing Krippendorff’s alpha-reliability.Communication Methods and Measures, 5(1):77–89, 2011

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Communication Methods and Measures, 5(1):77–89, 2011

work page 2011
[6]

The impact of artificial intelligence on the labor market

Michael Webb. The impact of artificial intelligence on the labor market. Technical report, Stanford University, 2020. 13

work page 2020

[1] [1]

Gpts are gpts: An early look at the labor market impact potential of large language models.Science, 384(6702): 1306–1308, 2024

Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models.Science, 384(6702): 1306–1308, 2024

work page 2024

[2] [2]

Augmented human capital: A unified theory and llm-based measure- ment framework for cognitive factor decomposition in ai-augmented economies

Cristian Espinal Maya. Augmented human capital: A unified theory and llm-based measure- ment framework for cognitive factor decomposition in ai-augmented economies. Technical report, Universidad EAFIT, 2026. arXiv preprint

work page 2026

[3] [3]

Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Manage- ment Journal, 42(12):2195–2217, 2021

Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Manage- ment Journal, 42(12):2195–2217, 2021

work page 2021

[4] [4]

Experimenting with measurement error: Tech- niques with applications to the caltech cohort study.Journal of Political Economy, 127(4): 1826–1863, 2019

Ben Gillen, Erik Snowberg, and Leeat Yariv. Experimenting with measurement error: Tech- niques with applications to the caltech cohort study.Journal of Political Economy, 127(4): 1826–1863, 2019

work page 2019

[5] [5]

Computing Krippendorff’s alpha-reliability.Communication Methods and Measures, 5(1):77–89, 2011

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Communication Methods and Measures, 5(1):77–89, 2011

work page 2011

[6] [6]

The impact of artificial intelligence on the labor market

Michael Webb. The impact of artificial intelligence on the labor market. Technical report, Stanford University, 2020. 13

work page 2020