Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics
Pith reviewed 2026-05-13 20:27 UTC · model grok-4.3
The pith
Large language models can serve as valid instruments for latent cognitive variables in occupational tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the conditions of semantic exogeneity, construct relevance, monotonicity, and model invariance, LLM-generated scores from occupational task statements constitute valid instruments for latent cognitive variables, as shown by the Augmented Human Capital Index constructed from 18,796 O*NET tasks that exhibits convergent validity with existing indices, discriminant validity across augmentation and substitution dimensions, and yields larger coefficients in Obviously Related Instrumental Variables estimation than OLS.
What carries the argument
The Augmented Human Capital Index (AHC_o) constructed by scoring O*NET task statements with Claude Haiku 4.5, which serves as an instrument for latent cognitive task content.
If this is right
- Labor models can recover larger effects of cognitive skills on wages and employment once measurement error is addressed with the index.
- AI exposure in occupations splits into two distinct dimensions of augmentation and substitution rather than a single scale.
- Inter-model reliability and prompt robustness checks support practical use of the scoring method across models.
- The framework scales to quantify any semantic features in large text corpora of job or activity descriptions.
Where Pith is reading between the lines
- Occupational skill measures could be refreshed regularly as new task data appears or as LLMs improve, allowing dynamic tracking of cognitive demands.
- The separation of augmentation and substitution may help target training or adjustment policies toward specific types of AI impact.
- The same scoring approach could be tested on latent traits such as creativity or social skills in other text-based datasets.
- Global application would require checking whether the validity conditions hold across languages and cultural contexts in labor data.
Load-bearing premise
The four conditions hold for the chosen LLM when applied to O*NET task statements, so that the generated scores capture the intended latent cognitive variables without unmeasured biases.
What would settle it
If the Augmented Human Capital Index shows low correlation with established AI exposure indices or if scores from different LLMs produce inconsistent task rankings that violate model invariance or monotonicity.
Figures
read the original abstract
This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs can serve as valid instruments for latent cognitive variables in labor economics by formalizing four conditions (semantic exogeneity, construct relevance, monotonicity, model invariance) and applying them to construct the Augmented Human Capital Index (AHC_o) from 18,796 O*NET task statements scored by Claude Haiku 4.5. It reports strong convergent validity (r=0.85 with Eloundou GPT-gamma, r=0.79 with Felten AIOE), discriminant validity, PCA identifying augmentation and substitution dimensions, inter-model reliability (Pearson r=0.76, Krippendorff alpha=0.71), robust prompt rankings, and ORIV coefficients 25% larger than OLS, consistent with measurement error correction. The approach is positioned as generalizable beyond labor economics.
Significance. If the four conditions hold, the work offers a scalable method to quantify fine-grained semantic content of occupational tasks beyond the limits of traditional surveys, with direct implications for research on AI exposure, skill measurement, and human capital. The empirical components—convergent correlations, reliability metrics, two-dimensional PCA structure, and the ORIV attenuation correction—provide concrete support for practical utility and reproducibility when code and prompts are shared.
major comments (3)
- [Theoretical Framework and Validation] The central claim that LLM-generated scores constitute valid instruments rests on the four conditions, but monotonicity and semantic exogeneity receive no direct test against independent human ratings of the latent cognitive constructs on O*NET tasks. Only construct relevance and model invariance are supported via the reported correlations (r=0.85, r=0.79) and inter-model r=0.76; without human-ground-truth monotonicity checks, the ORIV correction may not be valid.
- [Empirical Application] The Augmented Human Capital Index (AHC_o) construction from 18,796 statements is load-bearing for all validity claims, yet the manuscript provides no details on the exact scoring prompt, any preprocessing or exclusions, or how scores are aggregated into the index; this prevents verification that the reported convergent validity is not driven by prompt-induced artifacts.
- [ORIV Estimation Results] The ORIV result recovering coefficients 25% larger than OLS is presented as evidence of classical measurement error correction, but without the specific instrument construction, error structure assumptions, or robustness checks to alternative specifications, it is unclear whether the 25% figure is robust or sensitive to modeling choices.
minor comments (3)
- [Methods] The prompt sensitivity analysis across four alternative framings is mentioned but does not list the exact framings or report the full ranking correlations, which would strengthen the robustness claim.
- [Reliability Analysis] The sample size for inter-rater reliability (n=3,666 paired scores) is given, but the selection rule for which tasks were double-scored is not stated.
- [Validation] The abstract references validation against six existing AI exposure indices, but only two specific correlations are reported; a table listing all six would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us identify areas for clarification and improvement. We address each major comment point by point below, providing honest explanations grounded in the manuscript's framework and data. Revisions will be incorporated where feasible to enhance transparency and robustness without overstating the current evidence.
read point-by-point responses
-
Referee: [Theoretical Framework and Validation] The central claim that LLM-generated scores constitute valid instruments rests on the four conditions, but monotonicity and semantic exogeneity receive no direct test against independent human ratings of the latent cognitive constructs on O*NET tasks. Only construct relevance and model invariance are supported via the reported correlations (r=0.85, r=0.79) and inter-model r=0.76; without human-ground-truth monotonicity checks, the ORIV correction may not be valid.
Authors: We agree that direct empirical tests of monotonicity and semantic exogeneity against independent human ratings are not present in the current manuscript. The conditions are justified theoretically: semantic exogeneity is ensured by prompt instructions that direct scoring exclusively from task semantic content without referencing labor outcomes, while monotonicity follows from LLMs' training on human text corpora that encode cognitive demand semantics. Construct relevance and model invariance receive empirical support from the reported correlations and reliability metrics. We will add an expanded discussion subsection in the revised manuscript elaborating these theoretical justifications and proposing a protocol for future human validation studies. This addresses the concern by clarifying assumptions without claiming direct human tests. revision: partial
-
Referee: [Empirical Application] The Augmented Human Capital Index (AHC_o) construction from 18,796 statements is load-bearing for all validity claims, yet the manuscript provides no details on the exact scoring prompt, any preprocessing or exclusions, or how scores are aggregated into the index; this prevents verification that the reported convergent validity is not driven by prompt-induced artifacts.
Authors: We concur that detailed construction information is necessary for reproducibility and to rule out artifacts. The revised manuscript will include the full exact scoring prompt, a description of preprocessing (limited to standard O*NET task statements with no additional exclusions), and the aggregation method (task-level scores averaged per occupation, incorporating O*NET importance weights where available). We have verified that convergent validity persists under the alternative prompts already analyzed in the sensitivity checks, and these details will be added to the main text or appendix. revision: yes
-
Referee: [ORIV Estimation Results] The ORIV result recovering coefficients 25% larger than OLS is presented as evidence of classical measurement error correction, but without the specific instrument construction, error structure assumptions, or robustness checks to alternative specifications, it is unclear whether the 25% figure is robust or sensitive to modeling choices.
Authors: The ORIV approach follows the standard implementation using two LLM-generated scores as obviously related instruments, per the Gillen et al. (2019) framework, under the assumption of classical measurement error. The 25% larger coefficients are from the primary wage/employment specifications. In the revision, we will explicitly detail the instrument construction (pairing Claude Haiku scores with a second model), state the error assumptions, and add robustness tables in the appendix covering alternative instrument pairings and specifications. These additional checks confirm the magnitude is stable. revision: yes
- Direct human-ground-truth ratings for monotonicity and semantic exogeneity tests on the full set of 18,796 O*NET tasks, as this requires new primary data collection outside the scope of the current study.
Circularity Check
No circularity: LLM instrument conditions validated via external indices and standard methods
full rationale
The paper defines four conditions for valid LLM instruments and constructs the AHC_o index by scoring O*NET tasks with Claude Haiku 4.5. All load-bearing empirical steps rely on external benchmarks (convergent correlations r=0.85/0.79 with Eloundou/Felten indices, PCA for two dimensions, inter-model r=0.76, ORIV vs OLS comparison) and standard techniques rather than any self-definitional reduction, fitted parameter renamed as prediction, or self-citation chain. The central claims do not collapse into the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-generated scores satisfy semantic exogeneity, construct relevance, monotonicity, and model invariance for occupational task statements
invented entities (1)
-
Augmented Human Capital Index (AHC_o)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models.Science, 384(6702): 1306–1308, 2024
work page 2024
-
[2]
Cristian Espinal Maya. Augmented human capital: A unified theory and llm-based measure- ment framework for cognitive factor decomposition in ai-augmented economies. Technical report, Universidad EAFIT, 2026. arXiv preprint
work page 2026
-
[3]
Edward Felten, Manav Raj, and Robert Seamans. Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses.Strategic Manage- ment Journal, 42(12):2195–2217, 2021
work page 2021
-
[4]
Ben Gillen, Erik Snowberg, and Leeat Yariv. Experimenting with measurement error: Tech- niques with applications to the caltech cohort study.Journal of Political Economy, 127(4): 1826–1863, 2019
work page 2019
-
[5]
Computing Krippendorff’s alpha-reliability.Communication Methods and Measures, 5(1):77–89, 2011
Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Communication Methods and Measures, 5(1):77–89, 2011
work page 2011
-
[6]
The impact of artificial intelligence on the labor market
Michael Webb. The impact of artificial intelligence on the labor market. Technical report, Stanford University, 2020. 13
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.