Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

Manuel Pita

arxiv: 2606.28574 · v1 · pith:O4UEV6OCnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI· cs.CY

Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

Manuel Pita This is my paper

Pith reviewed 2026-06-30 00:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords construct validityLLM codinggrain calibrationmeasurement instrumentstheoretical constructsreliability versus validitynatural language processingsocial science measurement

0 comments

The pith

An LLM may agree with human coders on a construct yet still fail to measure it according to its defining theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When an LLM agrees with a human annotator on how to code a construct in text, that shows reliability but says nothing about whether the LLM is actually using the construct as the theory defines it. The model might arrive at the right code through some unrelated pattern. Grain calibration fixes this by splitting the construct into smaller, testable parts at the clause level, checking each part against the text with direct evidence, and then applying a clear rule to combine those checks. This makes the reasoning visible and tied to the theory rather than hidden in the model's output. A sympathetic reader would care because it turns validation into a check on whether the instrument truly measures what the theory intends, not just whether it matches human labels.

Core claim

Current methods validate LLMs as measurement instruments only by their agreement with human annotators, which establishes reliability but leaves construct validity unexamined. An LLM may produce the correct code through a correlate that satisfies none of the theory's requirements. Grain calibration closes this gap by decomposing the construct into clause-level components, testing each with extractive evidence from the text, and combining the results via an explicit rule derived from the theory. Because the rule is stated explicitly, the process itself becomes evidence of whether the instrument runs on the specified construct.

What carries the argument

Grain calibration: a method that decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines results through an explicit, theory-derived rule.

If this is right

Validation can distinguish theory-aligned measurement from correct codes reached via wrong reasons.
When a code is wrong, it identifies whether a component was missed or an adjacent construct was mistaken for it.
The explicit rule provides evidence about the measurement process rather than only the final output.
This applies to constructs that admit clause-level decomposition and evidence extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large-scale theoretical analysis could become feasible if grain calibration provides a transparent layer for automated coding at lower cost than full human annotation.
Social science theories may need more precise clause-level specifications to support this form of validation.
The method could be used to compare how different LLMs align with the same theoretical construct across datasets.
Similar decomposition and rule-based checks might apply to other AI systems used for measurement tasks.

Load-bearing premise

That a construct can be decomposed into clause-level components whose independent tests, when combined by an explicit rule, fully capture the demands the construct's theory makes without introducing new mismatches or biases.

What would settle it

A case where an LLM passes grain calibration on all components and the rule but produces codes that violate a core prediction of the theory on new texts, or where experts cannot agree on a clause-level decomposition that matches the theory.

read the original abstract

When a large language model (LLM) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder. Yet reliability leaves construct validity untouched. The instrument may be theory-naive, reaching the code through a correlate that meets none of the demands the construct's theory makes, and no current method tells that apart from genuine measurement. We propose grain calibration as a method that closes the gap. It decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines the results through an explicit, theory-derived rule. Because the rule is stated rather than lodged in one opaque pass, its structure is evidence about the process rather than the output. It shows which components settled a code, and, when the code is wrong, whether a component was missed or an adjacent construct mistaken for it. Validation shifts from scoring an instrument's outputs against an annotator to showing that the instrument runs on the construct its theory specifies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grain calibration names the reliability-vs-validity gap for LLM construct coding but leaves the decomposition step underspecified and untested.

read the letter

The main point is that this paper correctly flags how LLM-human agreement only shows reliability, not that the model is actually running on the theoretical construct rather than a correlate. Grain calibration tries to fix that by breaking the construct into clause-level pieces, pulling extractive evidence for each, and combining them with an explicit theory-derived rule so you can see what drove the code.

It does a clean job of shifting the validation question from output matching to process matching, and the visibility of the rule is a practical improvement over black-box prompting. That framing is useful for anyone doing computational social science measurement.

The soft spot is the decomposition itself. Nothing in the description shows how to derive the clauses from the theory in a way that avoids arbitrary choices or new mismatches, and the stress-test concern holds up: if two plausible decompositions produce different results, there is no built-in way to decide which one actually measures the intended construct. The abstract and available material give no examples, no implementation, and no test cases, so it is still a proposal rather than a demonstrated method.

This is for people building or evaluating LLM annotation pipelines who already care about measurement quality. A reader working on construct validity in text coding will find the conceptual move worth considering. It deserves a serious referee because the problem is real and the direction is worth developing, even if the current version needs concrete demonstrations and a clearer procedure for the decomposition step before it can be evaluated properly.

Recommendation: send it to review rather than desk reject, and ask the authors for worked examples and checks on whether different decompositions converge.

Referee Report

1 major / 0 minor

Summary. The paper argues that agreement between LLMs and human annotators on coding theoretical constructs establishes only reliability, not construct validity, because the LLM may reach the code via a correlate rather than the theory-specified process. It proposes 'grain calibration' as a solution: decompose the construct into clause-level components, test each component against the text using extractive evidence, and combine the results via an explicit, theory-derived rule. This makes the decision process inspectable, revealing which components drove the code and whether errors stem from missed components or confusion with adjacent constructs. Validation thereby shifts from output agreement to evidence that the instrument operates on the intended construct.

Significance. If operationalized, the approach would address a genuine gap in using LLMs for measurement in fields that rely on text-coded constructs, by providing process-level rather than purely correlational validation. It explicitly credits the shift from scoring outputs against annotators to demonstrating theory-aligned operation. No machine-checked proofs or reproducible code are present, but the proposal is falsifiable in principle through tests of alternative decompositions.

major comments (1)

[Abstract] Abstract, paragraph on grain calibration: the central claim requires that clause-level decomposition plus an explicit combination rule fully reproduces the theory's demands without new mismatches or biases, yet the manuscript provides no procedure for deriving or validating the decomposition from the theory itself. If two theory-informed decompositions of the same construct produce different component sets or rules, the method cannot adjudicate which (if either) measures the intended construct; this is load-bearing for the claim that grain calibration closes the validity gap.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this precise observation on the abstract. We address the point directly below.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph on grain calibration: the central claim requires that clause-level decomposition plus an explicit combination rule fully reproduces the theory's demands without new mismatches or biases, yet the manuscript provides no procedure for deriving or validating the decomposition from the theory itself. If two theory-informed decompositions of the same construct produce different component sets or rules, the method cannot adjudicate which (if either) measures the intended construct; this is load-bearing for the claim that grain calibration closes the validity gap.

Authors: Grain calibration treats decomposition as a prior theoretical step, not an output of the method. Researchers derive clause-level components and the combination rule directly from the construct's formal definition in the source theory, exactly as they do when operationalizing any construct for manual coding. The method then supplies extractive tests and an explicit rule so that an LLM's application of those components can be inspected. Different decompositions are adjudicated by their fidelity to the original theory, not by grain calibration itself; the calibration procedure simply makes visible whether the LLM follows the chosen components or substitutes a correlate. This division of labor is standard in measurement theory and does not weaken the claim that grain calibration moves validation from output agreement to process alignment once the components are specified. revision: no

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes grain calibration as a methodological framework for validating LLM coding of theoretical constructs. It describes decomposition into clause-level components, extractive testing, and explicit combination rules without any equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. The central claim concerns a shift in validation approach and does not reduce any result to its own inputs by construction; the method is presented as external to model internals and independent of the constructs being measured.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only; the proposal rests on the domain assumption that constructs admit clause-level decomposition and explicit rule combination without loss of theoretical content.

axioms (2)

domain assumption Theoretical constructs can be decomposed into clause-level components that can be tested independently against text
Core premise of grain calibration described in abstract
domain assumption An explicit theory-derived rule can combine component results without introducing new construct mismatches
Required for the method to provide evidence about the measurement process

invented entities (1)

grain calibration no independent evidence
purpose: Method to validate LLM coding of constructs via explicit component testing
Newly proposed procedure in the abstract

pith-pipeline@v0.9.1-grok · 5699 in / 1237 out tokens · 32752 ms · 2026-06-30T00:37:11.924014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023

Ellie Pavlick. Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023. ISSN 1364-503X. doi: 10.1098/rsta.2022.0041. URLhttps: //royalsocietypublishing.org/doi/10.1098/rsta.2022.0041

work page doi:10.1098/rsta.2022.0041 2023
[2]

Six principles for evaluating cognitive capabilities in AI models.AI Magazine, 47(2):e70061,

Melanie Mitchell. Six principles for evaluating cognitive capabilities in AI models.AI Magazine, 47(2):e70061,
[3]

doi: 10.1002/aaai.70061

work page doi:10.1002/aaai.70061
[4]

2420642122

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023. doi: 10.1073/pnas. 2305016120

work page doi:10.1073/pnas 2023
[5]

Large language models outperform expert coders and supervised classifiers at annotating political social media messages.Social Science Computer Review, 43(6):1181–1195, 2025

Petter Törnberg. Large language models outperform expert coders and supervised classifiers at annotating political social media messages.Social Science Computer Review, 43(6):1181–1195, 2025. doi: 10.1177/ 08944393241286471

2025
[6]

Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024

Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024. doi: 10. 1162/coli_a_00502

2024
[7]

Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks.arXiv preprint arXiv:2401.15170, 2024

Zackary Okun Dunivin. Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks.arXiv preprint arXiv:2401.15170, 2024

arXiv 2024
[8]

Robertson, and Jay J

Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire E. Robertson, and Jay J. Van Bavel. GPT is an effective tool for multilingual psychological text analysis.Proceedings of the National Academy of Sciences, 121(34):e2308950121, 2024. doi: 10.1073/pnas.2308950121. 11 LLMs as theory codersA PREPRINT

work page doi:10.1073/pnas.2308950121 2024
[9]

Michelene T. H. Chi, Paul J. Feltovich, and Robert Glaser. Categorization and Representation of Physics Prob- lems by Experts and Novices.Cognitive Science, 5(2):121–152, 1981. doi: 10.1207/s15516709cog0502_2

work page doi:10.1207/s15516709cog0502_2 1981
[10]

Bender and Alexander Koller

Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5185–5198, 2020. doi: 10.18653/v1/2020.acl-main.463

work page doi:10.18653/v1/2020.acl-main.463 2020
[11]

Structure-Mapping: A Theoretical Framework for Analogy.Cognitive Science, 7(2):155–170,

Dedre Gentner. Structure-Mapping: A Theoretical Framework for Analogy.Cognitive Science, 7(2):155–170,
[12]

doi: 10.1207/s15516709cog0702_3

work page doi:10.1207/s15516709cog0702_3
[13]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computation...

work page doi:10.18653/v1/2022.emnlp-main.759 2022
[14]

Bacharach

Samuel B. Bacharach. Organizational theories: Some Criteria for Evaluation.Academy of Management Review, 14(4):496–515, 1989. doi: 10.5465/amr.1989.4308374

work page doi:10.5465/amr.1989.4308374 1989
[15]

Kenneth MacCorquodale and Paul E. Meehl. On a distinction between hypothetical constructs and intervening variables.Psychological Review, 55(2):95–107, 1948. doi: 10.1037/h0056029

work page doi:10.1037/h0056029 1948
[16]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665– 673, 2020. doi: 10.1038/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020
[17]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. doi: 10.18653/v1/P19-1334

work page doi:10.18653/v1/p19-1334 2019
[18]

Using large language models for qualitative analy- sis can introduce serious bias.Sociological Methods & Research, 2025

Julian Ashwin, Aditya Chhabra, and Vijayendra Rao. Using large language models for qualitative analy- sis can introduce serious bias.Sociological Methods & Research, 2025. ISSN 0049-1241. doi: 10.1177/ 00491241251338246

2025
[19]

Matz, Heinrich Peters, Moran Cerf, Eric Grunenberg, Paul W

Sandra C. Matz, Heinrich Peters, Moran Cerf, Eric Grunenberg, Paul W. Eastwick, Mitja D. Back, and Eli J. Finkel. Large language models can detect verbal indicators of romantic attraction.Scientific Reports, 2026. doi: 10.1038/s41598-026-52308-x

work page doi:10.1038/s41598-026-52308-x 2026
[20]

Wojcik, and Peter H

Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P. Wojcik, and Peter H. Ditto. Moral foundations theory: The pragmatic validity of moral pluralism. In Patricia Devine and Ashby Plant, editors, Advances in Experimental Social Psychology, volume 47, pages 55–130. Academic Press, 2013. doi: 10.1016/ B978-0-12-407236-7.00002-4

2013
[21]

Stevens, and Morteza Dehghani

Mohammad Atari, Jonathan Haidt, Jesse Graham, Sena Koleva, Sean T. Stevens, and Morteza Dehghani. Moral- ity beyond the WEIRD: How the nomological network of morality varies across cultures.Journal of Personality and Social Psychology, 125(5):1157–1188, 2023. doi: 10.1037/pspp0000470

work page doi:10.1037/pspp0000470 2023
[22]

Large language models meet moral values: A comprehensive assessment of moral abilities.Computers in Human Behavior Reports, 17:100609,

Luana Bulla, Stefano De Giorgis, Misael Mongiovì, and Aldo Gangemi. Large language models meet moral values: A comprehensive assessment of moral abilities.Computers in Human Behavior Reports, 17:100609,
[23]

doi: 10.1016/j.chbr.2025.100609

work page doi:10.1016/j.chbr.2025.100609 2025
[24]

Xue, Jackson Trager, Peter S

Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J. Xue, Jackson Trager, Peter S. Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. Perils and opportunities in using large language models in psychological research.PNAS Nexus, 3(7):pgae245, 2024. doi: 10.1093/pnasnexus/pgae245

work page doi:10.1093/pnasnexus/pgae245 2024
[25]

Bestvater and Burt L

Samuel E. Bestvater and Burt L. Monroe. Sentiment is not stance: Target-aware opinion classification for political text analysis.Political Analysis, 31(2):235–256, 2023. doi: 10.1017/pan.2022.10

work page doi:10.1017/pan.2022.10 2023
[26]

A synopsis of linguistic theory, 1930–1955

John Rupert Firth. A synopsis of linguistic theory, 1930–1955. InStudies in Linguistic Analysis, pages 1–32. Blackwell, Oxford, 1957

1930
[27]

Ivanova, Idan A

Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fe- dorenko. Dissociating language and thought in large language models.Trends in Cognitive Sciences, 28(6): 517–540, 2024. ISSN 1364-6613. doi: 10.1016/j.tics.2024.01.011. URLhttps://www.sciencedirect. com/science/article/pii/S1364661324000275

work page doi:10.1016/j.tics.2024.01.011 2024
[28]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623, 2021. doi: 10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021
[29]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 12 LLMs as theory codersA PREPRINT

2022
[30]

A validity-guided workflow for robust large language model research in psychology.arXiv preprint arXiv:2507.04491, 2025

Zhicheng Lin. A validity-guided workflow for robust large language model research in psychology.arXiv preprint arXiv:2507.04491, 2025. doi: 10.31234/osf.io/xw98v

work page doi:10.31234/osf.io/xw98v 2025
[31]

Andrew Halterman and Katherine A. Keith. Codebook LLMs: Evaluating LLMs as measurement tools for political science concepts.Political Analysis, 2025. doi: 10.1017/pan.2025.10017

work page doi:10.1017/pan.2025.10017 2025
[32]

Measuring complex constructs in large-scale text with computational social mixed methods.PsyArXiv, 2025

Alina Herderich, Jana Lasser, Mirta Galesic, Segun Taofeek Aroyehun, David Garcia, and Joshua Garland. Measuring complex constructs in large-scale text with computational social mixed methods.PsyArXiv, 2025. doi: 10.31234/osf.io/tzc9p. URLhttps://doi.org/10.31234/osf.io/tzc9p

work page doi:10.31234/osf.io/tzc9p 2025
[33]

Psychological Bulletin , author =

Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4): 281–302, 1955. doi: 10.1037/h0040957

work page doi:10.1037/h0040957 1955
[34]

What is a knowledge representation?AI Magazine, 14(1): 17–33, 1993

Randall Davis, Howard Shrobe, and Peter Szolovits. What is a knowledge representation?AI Magazine, 14(1): 17–33, 1993. doi: 10.1609/aimag.v14i1.1029

work page doi:10.1609/aimag.v14i1.1029 1993
[35]

Samuel Messick. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning.American Psychologist, 50(9):741–749, 1995. doi: 10. 1037/0003-066X.50.9.741

1995
[36]

Establishing construct validity in LLM capability benchmarks requires nomological networks

Timo Freiesleben. Establishing construct validity in LLM capability benchmarks requires nomological networks. arXiv preprint arXiv:2603.15121, 2026. doi: 10.48550/arXiv.2603.15121. URLhttps://arxiv.org/abs/ 2603.15121

work page doi:10.48550/arxiv.2603.15121 2026
[37]

Michael T. Kane. Validating the interpretations and uses of test scores.Journal of Educational Measurement, 50 (1):1–73, 2013. doi: 10.1111/jedm.12000

work page doi:10.1111/jedm.12000 2013
[38]

From prompts to constructs: A dual-validity framework for LLM research in psychology.arXiv preprint arXiv:2506.16697, 2025

Zhicheng Lin. From prompts to constructs: A dual-validity framework for LLM research in psychology.arXiv preprint arXiv:2506.16697, 2025. doi: 10.48550/arXiv.2506.16697

work page doi:10.48550/arxiv.2506.16697 2025
[39]

Bean, Ryan Othniel Kearns, Angelika Romanou, et al

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, et al. Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), Datasets and Benchmarks Track, 2025

2025
[40]

Argyle, James Bisbee, Michael Heseltine, Christopher Lucas, Jon Mellon, Alexis Palmer, Margaret Roberts, and Arthur Spirling

Christopher Barrie, Lisa P. Argyle, James Bisbee, Michael Heseltine, Christopher Lucas, Jon Mellon, Alexis Palmer, Margaret Roberts, and Arthur Spirling. AI and research methods.APSA Preprints (Cambridge Open Engage), 2026. doi: 10.33774/apsa-2026-h59kk. URLhttps://doi.org/10.33774/apsa-2026-h59kk

work page doi:10.33774/apsa-2026-h59kk 2026
[41]

Prompt stability scoring for text annotation with large language models.arXiv preprint arXiv:2407.02039, 2024

Christopher Barrie, Panagiota Palaiologou, and Petter Törnberg. Prompt stability scoring for text annotation with large language models.arXiv preprint arXiv:2407.02039, 2024

Pith/arXiv arXiv 2024
[42]

Repurposing annotation guidelines to instruct LLM annotators: A case study

Kon Woo Kim, Rezarta Islamaj, Jin-Dong Kim, Florian Boudin, and Akiko Aizawa. Repurposing annotation guidelines to instruct LLM annotators: A case study. InInternational Conference on Applications of Natural Language to Information Systems (NLDB 2025), Lecture Notes in Computer Science, pages 140–151. Springer, 2025

2025
[43]

Do LLMs adhere to label definitions? Examining their receptivity to external label definitions

Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, and Manas Gaur. Do LLMs adhere to label definitions? Examining their receptivity to external label definitions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 32380–32393, 2025. doi: 10.18...

work page doi:10.18653/v1/2025.emnlp-main.1648 2025
[44]

GoLLIE: Annotation guidelines improve zero-shot information-extraction

Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, and Eneko Agirre. GoLLIE: Annotation guidelines improve zero-shot information-extraction. InProceedings of the Twelfth Inter- national Conference on Learning Representations (ICLR 2024), 2024

2024
[45]

Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning

Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Jason Wu. Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023

2023
[46]

Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

Ziang Xiao, Xingdi Yuan, Q. Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer. Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. InCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23), pages 75–78, 2023. doi: 10.1145/3581754.3584136

work page doi:10.1145/3581754.3584136 2023
[47]

Show less, instruct more: Enriching prompts with definitions and guidelines for zero-shot NER.arXiv preprint arXiv:2407.01272, 2024

Andrew Zamai, Andrea Zugarini, Leonardo Rigutini, Marco Ernandes, and Marco Maggini. Show less, instruct more: Enriching prompts with definitions and guidelines for zero-shot NER.arXiv preprint arXiv:2407.01272, 2024

arXiv 2024
[48]

A step-by-step method for cultural annotation by LLMs.Frontiers in Artificial Intelligence, 7:1365508, 2024

Edgar Dubourg, Valentin Thouzeau, and Nicolas Baumard. A step-by-step method for cultural annotation by LLMs.Frontiers in Artificial Intelligence, 7:1365508, 2024. 13 LLMs as theory codersA PREPRINT

2024
[49]

Automated annotation with generative AI requires validation.arXiv preprint arXiv:2306.00176, 2023

Nicholas Pangakis, Samuel Wolken, and Neil Fasching. Automated annotation with generative AI requires validation.arXiv preprint arXiv:2306.00176, 2023

arXiv 2023
[50]

Yeager, Christopher J

Dorottya Demszky, Diyi Yang, David S. Yeager, Christopher J. Bryan, Margarett Clapper, Susannah Chand- hok, Johannes C. Eichstaedt, Cameron Hecht, Jeremy Jamieson, Molly Johnson, Michaela Jones, Desmond Krettek-Cobb, Leslie Lai, Nirel JonesMitchell, Desmond C. Ong, Carol S. Dweck, James J. Gross, and James W. Pennebaker. Using large language models in psy...

work page doi:10.1038/s44159-023-00241-5 2023
[51]

Oltmanns

Jocelyn Brickman, Mehak Gupta, and Joshua R. Oltmanns. Large language models for psychological assessment: A comprehensive overview.Advances in Methods and Practices in Psychological Science, 2025. doi: 10.1177/ 25152459251343582

2025
[52]

The repeated adjustment of measurement protocols (RAMP) method for developing high-validity text classifiers.Psychological Methods, 2025

Alex Goddard and Alex Gillespie. The repeated adjustment of measurement protocols (RAMP) method for developing high-validity text classifiers.Psychological Methods, 2025. doi: 10.1037/met0000787

work page doi:10.1037/met0000787 2025
[53]

Moore, Daniel M

Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza De- hghani. A primer for evaluating large language models in social-science research.Advances in Methods and Practices in Psychological Science, 2025. doi: 10.1177/25152459251325174

work page doi:10.1177/25152459251325174 2025
[54]

ValiText: A unified validation framework for com- putational text-based measures of social constructs.arXiv preprint, 2023

Lukas Birkenmaier, Claudia Wagner, and Clemens Lechner. ValiText: A unified validation framework for com- putational text-based measures of social constructs.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2307.02863

arXiv 2023
[55]

Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

Cristian Espinal Maya. Measuring what cannot be surveyed: LLMs as instruments for latent cognitive variables in labor economics.arXiv preprint arXiv:2604.02403, 2026. doi: 10.48550/arXiv.2604.02403. URLhttps: //arxiv.org/abs/2604.02403

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02403 2026
[56]

Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation

Chenyu Hou, Gaoxia Zhu, Lishan Zheng, Xiaoshan Huang, Tianlong Zhong, Hanxiang Li, Han Du, and Chin Lee Ker. Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation. InProceedings of the 14th Learning Analytics and Knowledge Conference, pages 518–528, 2024

2024
[57]

Harding, Björn Ross, and Grégory Renard

Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, and Grégory Renard. The insight-inference loop: Efficient text classification via natural language inference and threshold-tuning.Sociological Methods & Research, 55(2):568–615, 2026. doi: 10.1177/00491241251326819

work page doi:10.1177/00491241251326819 2026
[58]

Enhancing LLM- based data annotation with error decomposition

Zhen Xu, Vedant Khatri, Yijun Dai, Xiner Liu, Siyan Li, Xuanming Zhang, and Renzhe Yu. Enhancing LLM- based data annotation with error decomposition. InProceedings of the International Conference on Learning Analytics and Knowledge (LAK ’26). ACM, 2026. doi: 10.1145/3785022.3785070. URLhttps://arxiv. org/abs/2601.11920

work page doi:10.1145/3785022.3785070 2026
[59]

Introducing Qualitative Methods

Kathy Charmaz.Constructing Grounded Theory. Introducing Qualitative Methods. Sage, London, 2nd edition,
[60]

Do AI models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

Claas Beger, Ryan Yi, Shuhao Fu, Kaleda Denton, Arseny Moskvichev, Sarah Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell. Do AI models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025. doi: 10.48550/arXiv.2510.02125

work page doi:10.48550/arxiv.2510.02125 2025
[61]

Stewart, and Hanying Wei

Naoki Egami, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. Using imperfect sur- rogates for downstream inference: Design-based supervised learning for social science applica- tions of large language models. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ d862f7f54...

2023

[1] [1]

Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023

Ellie Pavlick. Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023. ISSN 1364-503X. doi: 10.1098/rsta.2022.0041. URLhttps: //royalsocietypublishing.org/doi/10.1098/rsta.2022.0041

work page doi:10.1098/rsta.2022.0041 2023

[2] [2]

Six principles for evaluating cognitive capabilities in AI models.AI Magazine, 47(2):e70061,

Melanie Mitchell. Six principles for evaluating cognitive capabilities in AI models.AI Magazine, 47(2):e70061,

[3] [3]

doi: 10.1002/aaai.70061

work page doi:10.1002/aaai.70061

[4] [4]

2420642122

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023. doi: 10.1073/pnas. 2305016120

work page doi:10.1073/pnas 2023

[5] [5]

Large language models outperform expert coders and supervised classifiers at annotating political social media messages.Social Science Computer Review, 43(6):1181–1195, 2025

Petter Törnberg. Large language models outperform expert coders and supervised classifiers at annotating political social media messages.Social Science Computer Review, 43(6):1181–1195, 2025. doi: 10.1177/ 08944393241286471

2025

[6] [6]

Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024

Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024. doi: 10. 1162/coli_a_00502

2024

[7] [7]

Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks.arXiv preprint arXiv:2401.15170, 2024

Zackary Okun Dunivin. Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks.arXiv preprint arXiv:2401.15170, 2024

arXiv 2024

[8] [8]

Robertson, and Jay J

Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire E. Robertson, and Jay J. Van Bavel. GPT is an effective tool for multilingual psychological text analysis.Proceedings of the National Academy of Sciences, 121(34):e2308950121, 2024. doi: 10.1073/pnas.2308950121. 11 LLMs as theory codersA PREPRINT

work page doi:10.1073/pnas.2308950121 2024

[9] [9]

Michelene T. H. Chi, Paul J. Feltovich, and Robert Glaser. Categorization and Representation of Physics Prob- lems by Experts and Novices.Cognitive Science, 5(2):121–152, 1981. doi: 10.1207/s15516709cog0502_2

work page doi:10.1207/s15516709cog0502_2 1981

[10] [10]

Bender and Alexander Koller

Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5185–5198, 2020. doi: 10.18653/v1/2020.acl-main.463

work page doi:10.18653/v1/2020.acl-main.463 2020

[11] [11]

Structure-Mapping: A Theoretical Framework for Analogy.Cognitive Science, 7(2):155–170,

Dedre Gentner. Structure-Mapping: A Theoretical Framework for Analogy.Cognitive Science, 7(2):155–170,

[12] [12]

doi: 10.1207/s15516709cog0702_3

work page doi:10.1207/s15516709cog0702_3

[13] [13]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computation...

work page doi:10.18653/v1/2022.emnlp-main.759 2022

[14] [14]

Bacharach

Samuel B. Bacharach. Organizational theories: Some Criteria for Evaluation.Academy of Management Review, 14(4):496–515, 1989. doi: 10.5465/amr.1989.4308374

work page doi:10.5465/amr.1989.4308374 1989

[15] [15]

Kenneth MacCorquodale and Paul E. Meehl. On a distinction between hypothetical constructs and intervening variables.Psychological Review, 55(2):95–107, 1948. doi: 10.1037/h0056029

work page doi:10.1037/h0056029 1948

[16] [16]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665– 673, 2020. doi: 10.1038/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020

[17] [17]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. doi: 10.18653/v1/P19-1334

work page doi:10.18653/v1/p19-1334 2019

[18] [18]

Using large language models for qualitative analy- sis can introduce serious bias.Sociological Methods & Research, 2025

Julian Ashwin, Aditya Chhabra, and Vijayendra Rao. Using large language models for qualitative analy- sis can introduce serious bias.Sociological Methods & Research, 2025. ISSN 0049-1241. doi: 10.1177/ 00491241251338246

2025

[19] [19]

Matz, Heinrich Peters, Moran Cerf, Eric Grunenberg, Paul W

Sandra C. Matz, Heinrich Peters, Moran Cerf, Eric Grunenberg, Paul W. Eastwick, Mitja D. Back, and Eli J. Finkel. Large language models can detect verbal indicators of romantic attraction.Scientific Reports, 2026. doi: 10.1038/s41598-026-52308-x

work page doi:10.1038/s41598-026-52308-x 2026

[20] [20]

Wojcik, and Peter H

Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P. Wojcik, and Peter H. Ditto. Moral foundations theory: The pragmatic validity of moral pluralism. In Patricia Devine and Ashby Plant, editors, Advances in Experimental Social Psychology, volume 47, pages 55–130. Academic Press, 2013. doi: 10.1016/ B978-0-12-407236-7.00002-4

2013

[21] [21]

Stevens, and Morteza Dehghani

Mohammad Atari, Jonathan Haidt, Jesse Graham, Sena Koleva, Sean T. Stevens, and Morteza Dehghani. Moral- ity beyond the WEIRD: How the nomological network of morality varies across cultures.Journal of Personality and Social Psychology, 125(5):1157–1188, 2023. doi: 10.1037/pspp0000470

work page doi:10.1037/pspp0000470 2023

[22] [22]

Large language models meet moral values: A comprehensive assessment of moral abilities.Computers in Human Behavior Reports, 17:100609,

Luana Bulla, Stefano De Giorgis, Misael Mongiovì, and Aldo Gangemi. Large language models meet moral values: A comprehensive assessment of moral abilities.Computers in Human Behavior Reports, 17:100609,

[23] [23]

doi: 10.1016/j.chbr.2025.100609

work page doi:10.1016/j.chbr.2025.100609 2025

[24] [24]

Xue, Jackson Trager, Peter S

Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J. Xue, Jackson Trager, Peter S. Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. Perils and opportunities in using large language models in psychological research.PNAS Nexus, 3(7):pgae245, 2024. doi: 10.1093/pnasnexus/pgae245

work page doi:10.1093/pnasnexus/pgae245 2024

[25] [25]

Bestvater and Burt L

Samuel E. Bestvater and Burt L. Monroe. Sentiment is not stance: Target-aware opinion classification for political text analysis.Political Analysis, 31(2):235–256, 2023. doi: 10.1017/pan.2022.10

work page doi:10.1017/pan.2022.10 2023

[26] [26]

A synopsis of linguistic theory, 1930–1955

John Rupert Firth. A synopsis of linguistic theory, 1930–1955. InStudies in Linguistic Analysis, pages 1–32. Blackwell, Oxford, 1957

1930

[27] [27]

Ivanova, Idan A

Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fe- dorenko. Dissociating language and thought in large language models.Trends in Cognitive Sciences, 28(6): 517–540, 2024. ISSN 1364-6613. doi: 10.1016/j.tics.2024.01.011. URLhttps://www.sciencedirect. com/science/article/pii/S1364661324000275

work page doi:10.1016/j.tics.2024.01.011 2024

[28] [28]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623, 2021. doi: 10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021

[29] [29]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 12 LLMs as theory codersA PREPRINT

2022

[30] [30]

A validity-guided workflow for robust large language model research in psychology.arXiv preprint arXiv:2507.04491, 2025

Zhicheng Lin. A validity-guided workflow for robust large language model research in psychology.arXiv preprint arXiv:2507.04491, 2025. doi: 10.31234/osf.io/xw98v

work page doi:10.31234/osf.io/xw98v 2025

[31] [31]

Andrew Halterman and Katherine A. Keith. Codebook LLMs: Evaluating LLMs as measurement tools for political science concepts.Political Analysis, 2025. doi: 10.1017/pan.2025.10017

work page doi:10.1017/pan.2025.10017 2025

[32] [32]

Measuring complex constructs in large-scale text with computational social mixed methods.PsyArXiv, 2025

Alina Herderich, Jana Lasser, Mirta Galesic, Segun Taofeek Aroyehun, David Garcia, and Joshua Garland. Measuring complex constructs in large-scale text with computational social mixed methods.PsyArXiv, 2025. doi: 10.31234/osf.io/tzc9p. URLhttps://doi.org/10.31234/osf.io/tzc9p

work page doi:10.31234/osf.io/tzc9p 2025

[33] [33]

Psychological Bulletin , author =

Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4): 281–302, 1955. doi: 10.1037/h0040957

work page doi:10.1037/h0040957 1955

[34] [34]

What is a knowledge representation?AI Magazine, 14(1): 17–33, 1993

Randall Davis, Howard Shrobe, and Peter Szolovits. What is a knowledge representation?AI Magazine, 14(1): 17–33, 1993. doi: 10.1609/aimag.v14i1.1029

work page doi:10.1609/aimag.v14i1.1029 1993

[35] [35]

Samuel Messick. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning.American Psychologist, 50(9):741–749, 1995. doi: 10. 1037/0003-066X.50.9.741

1995

[36] [36]

Establishing construct validity in LLM capability benchmarks requires nomological networks

Timo Freiesleben. Establishing construct validity in LLM capability benchmarks requires nomological networks. arXiv preprint arXiv:2603.15121, 2026. doi: 10.48550/arXiv.2603.15121. URLhttps://arxiv.org/abs/ 2603.15121

work page doi:10.48550/arxiv.2603.15121 2026

[37] [37]

Michael T. Kane. Validating the interpretations and uses of test scores.Journal of Educational Measurement, 50 (1):1–73, 2013. doi: 10.1111/jedm.12000

work page doi:10.1111/jedm.12000 2013

[38] [38]

From prompts to constructs: A dual-validity framework for LLM research in psychology.arXiv preprint arXiv:2506.16697, 2025

Zhicheng Lin. From prompts to constructs: A dual-validity framework for LLM research in psychology.arXiv preprint arXiv:2506.16697, 2025. doi: 10.48550/arXiv.2506.16697

work page doi:10.48550/arxiv.2506.16697 2025

[39] [39]

Bean, Ryan Othniel Kearns, Angelika Romanou, et al

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, et al. Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), Datasets and Benchmarks Track, 2025

2025

[40] [40]

Argyle, James Bisbee, Michael Heseltine, Christopher Lucas, Jon Mellon, Alexis Palmer, Margaret Roberts, and Arthur Spirling

Christopher Barrie, Lisa P. Argyle, James Bisbee, Michael Heseltine, Christopher Lucas, Jon Mellon, Alexis Palmer, Margaret Roberts, and Arthur Spirling. AI and research methods.APSA Preprints (Cambridge Open Engage), 2026. doi: 10.33774/apsa-2026-h59kk. URLhttps://doi.org/10.33774/apsa-2026-h59kk

work page doi:10.33774/apsa-2026-h59kk 2026

[41] [41]

Prompt stability scoring for text annotation with large language models.arXiv preprint arXiv:2407.02039, 2024

Christopher Barrie, Panagiota Palaiologou, and Petter Törnberg. Prompt stability scoring for text annotation with large language models.arXiv preprint arXiv:2407.02039, 2024

Pith/arXiv arXiv 2024

[42] [42]

Repurposing annotation guidelines to instruct LLM annotators: A case study

Kon Woo Kim, Rezarta Islamaj, Jin-Dong Kim, Florian Boudin, and Akiko Aizawa. Repurposing annotation guidelines to instruct LLM annotators: A case study. InInternational Conference on Applications of Natural Language to Information Systems (NLDB 2025), Lecture Notes in Computer Science, pages 140–151. Springer, 2025

2025

[43] [43]

Do LLMs adhere to label definitions? Examining their receptivity to external label definitions

Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, and Manas Gaur. Do LLMs adhere to label definitions? Examining their receptivity to external label definitions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 32380–32393, 2025. doi: 10.18...

work page doi:10.18653/v1/2025.emnlp-main.1648 2025

[44] [44]

GoLLIE: Annotation guidelines improve zero-shot information-extraction

Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, and Eneko Agirre. GoLLIE: Annotation guidelines improve zero-shot information-extraction. InProceedings of the Twelfth Inter- national Conference on Learning Representations (ICLR 2024), 2024

2024

[45] [45]

Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning

Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Jason Wu. Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023

2023

[46] [46]

Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

Ziang Xiao, Xingdi Yuan, Q. Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer. Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. InCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23), pages 75–78, 2023. doi: 10.1145/3581754.3584136

work page doi:10.1145/3581754.3584136 2023

[47] [47]

Show less, instruct more: Enriching prompts with definitions and guidelines for zero-shot NER.arXiv preprint arXiv:2407.01272, 2024

Andrew Zamai, Andrea Zugarini, Leonardo Rigutini, Marco Ernandes, and Marco Maggini. Show less, instruct more: Enriching prompts with definitions and guidelines for zero-shot NER.arXiv preprint arXiv:2407.01272, 2024

arXiv 2024

[48] [48]

A step-by-step method for cultural annotation by LLMs.Frontiers in Artificial Intelligence, 7:1365508, 2024

Edgar Dubourg, Valentin Thouzeau, and Nicolas Baumard. A step-by-step method for cultural annotation by LLMs.Frontiers in Artificial Intelligence, 7:1365508, 2024. 13 LLMs as theory codersA PREPRINT

2024

[49] [49]

Automated annotation with generative AI requires validation.arXiv preprint arXiv:2306.00176, 2023

Nicholas Pangakis, Samuel Wolken, and Neil Fasching. Automated annotation with generative AI requires validation.arXiv preprint arXiv:2306.00176, 2023

arXiv 2023

[50] [50]

Yeager, Christopher J

Dorottya Demszky, Diyi Yang, David S. Yeager, Christopher J. Bryan, Margarett Clapper, Susannah Chand- hok, Johannes C. Eichstaedt, Cameron Hecht, Jeremy Jamieson, Molly Johnson, Michaela Jones, Desmond Krettek-Cobb, Leslie Lai, Nirel JonesMitchell, Desmond C. Ong, Carol S. Dweck, James J. Gross, and James W. Pennebaker. Using large language models in psy...

work page doi:10.1038/s44159-023-00241-5 2023

[51] [51]

Oltmanns

Jocelyn Brickman, Mehak Gupta, and Joshua R. Oltmanns. Large language models for psychological assessment: A comprehensive overview.Advances in Methods and Practices in Psychological Science, 2025. doi: 10.1177/ 25152459251343582

2025

[52] [52]

The repeated adjustment of measurement protocols (RAMP) method for developing high-validity text classifiers.Psychological Methods, 2025

Alex Goddard and Alex Gillespie. The repeated adjustment of measurement protocols (RAMP) method for developing high-validity text classifiers.Psychological Methods, 2025. doi: 10.1037/met0000787

work page doi:10.1037/met0000787 2025

[53] [53]

Moore, Daniel M

Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza De- hghani. A primer for evaluating large language models in social-science research.Advances in Methods and Practices in Psychological Science, 2025. doi: 10.1177/25152459251325174

work page doi:10.1177/25152459251325174 2025

[54] [54]

ValiText: A unified validation framework for com- putational text-based measures of social constructs.arXiv preprint, 2023

Lukas Birkenmaier, Claudia Wagner, and Clemens Lechner. ValiText: A unified validation framework for com- putational text-based measures of social constructs.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2307.02863

arXiv 2023

[55] [55]

Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

Cristian Espinal Maya. Measuring what cannot be surveyed: LLMs as instruments for latent cognitive variables in labor economics.arXiv preprint arXiv:2604.02403, 2026. doi: 10.48550/arXiv.2604.02403. URLhttps: //arxiv.org/abs/2604.02403

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02403 2026

[56] [56]

Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation

Chenyu Hou, Gaoxia Zhu, Lishan Zheng, Xiaoshan Huang, Tianlong Zhong, Hanxiang Li, Han Du, and Chin Lee Ker. Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation. InProceedings of the 14th Learning Analytics and Knowledge Conference, pages 518–528, 2024

2024

[57] [57]

Harding, Björn Ross, and Grégory Renard

Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, and Grégory Renard. The insight-inference loop: Efficient text classification via natural language inference and threshold-tuning.Sociological Methods & Research, 55(2):568–615, 2026. doi: 10.1177/00491241251326819

work page doi:10.1177/00491241251326819 2026

[58] [58]

Enhancing LLM- based data annotation with error decomposition

Zhen Xu, Vedant Khatri, Yijun Dai, Xiner Liu, Siyan Li, Xuanming Zhang, and Renzhe Yu. Enhancing LLM- based data annotation with error decomposition. InProceedings of the International Conference on Learning Analytics and Knowledge (LAK ’26). ACM, 2026. doi: 10.1145/3785022.3785070. URLhttps://arxiv. org/abs/2601.11920

work page doi:10.1145/3785022.3785070 2026

[59] [59]

Introducing Qualitative Methods

Kathy Charmaz.Constructing Grounded Theory. Introducing Qualitative Methods. Sage, London, 2nd edition,

[60] [60]

Do AI models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

Claas Beger, Ryan Yi, Shuhao Fu, Kaleda Denton, Arseny Moskvichev, Sarah Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell. Do AI models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025. doi: 10.48550/arXiv.2510.02125

work page doi:10.48550/arxiv.2510.02125 2025

[61] [61]

Stewart, and Hanying Wei

Naoki Egami, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. Using imperfect sur- rogates for downstream inference: Design-based supervised learning for social science applica- tions of large language models. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ d862f7f54...

2023