pith. sign in

arxiv: 2606.28574 · v1 · pith:O4UEV6OCnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI· cs.CY

Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

Pith reviewed 2026-06-30 00:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords construct validityLLM codinggrain calibrationmeasurement instrumentstheoretical constructsreliability versus validitynatural language processingsocial science measurement
0
0 comments X

The pith

An LLM may agree with human coders on a construct yet still fail to measure it according to its defining theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When an LLM agrees with a human annotator on how to code a construct in text, that shows reliability but says nothing about whether the LLM is actually using the construct as the theory defines it. The model might arrive at the right code through some unrelated pattern. Grain calibration fixes this by splitting the construct into smaller, testable parts at the clause level, checking each part against the text with direct evidence, and then applying a clear rule to combine those checks. This makes the reasoning visible and tied to the theory rather than hidden in the model's output. A sympathetic reader would care because it turns validation into a check on whether the instrument truly measures what the theory intends, not just whether it matches human labels.

Core claim

Current methods validate LLMs as measurement instruments only by their agreement with human annotators, which establishes reliability but leaves construct validity unexamined. An LLM may produce the correct code through a correlate that satisfies none of the theory's requirements. Grain calibration closes this gap by decomposing the construct into clause-level components, testing each with extractive evidence from the text, and combining the results via an explicit rule derived from the theory. Because the rule is stated explicitly, the process itself becomes evidence of whether the instrument runs on the specified construct.

What carries the argument

Grain calibration: a method that decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines results through an explicit, theory-derived rule.

If this is right

  • Validation can distinguish theory-aligned measurement from correct codes reached via wrong reasons.
  • When a code is wrong, it identifies whether a component was missed or an adjacent construct was mistaken for it.
  • The explicit rule provides evidence about the measurement process rather than only the final output.
  • This applies to constructs that admit clause-level decomposition and evidence extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large-scale theoretical analysis could become feasible if grain calibration provides a transparent layer for automated coding at lower cost than full human annotation.
  • Social science theories may need more precise clause-level specifications to support this form of validation.
  • The method could be used to compare how different LLMs align with the same theoretical construct across datasets.
  • Similar decomposition and rule-based checks might apply to other AI systems used for measurement tasks.

Load-bearing premise

That a construct can be decomposed into clause-level components whose independent tests, when combined by an explicit rule, fully capture the demands the construct's theory makes without introducing new mismatches or biases.

What would settle it

A case where an LLM passes grain calibration on all components and the rule but produces codes that violate a core prediction of the theory on new texts, or where experts cannot agree on a clause-level decomposition that matches the theory.

read the original abstract

When a large language model (LLM) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder. Yet reliability leaves construct validity untouched. The instrument may be theory-naive, reaching the code through a correlate that meets none of the demands the construct's theory makes, and no current method tells that apart from genuine measurement. We propose grain calibration as a method that closes the gap. It decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines the results through an explicit, theory-derived rule. Because the rule is stated rather than lodged in one opaque pass, its structure is evidence about the process rather than the output. It shows which components settled a code, and, when the code is wrong, whether a component was missed or an adjacent construct mistaken for it. Validation shifts from scoring an instrument's outputs against an annotator to showing that the instrument runs on the construct its theory specifies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper argues that agreement between LLMs and human annotators on coding theoretical constructs establishes only reliability, not construct validity, because the LLM may reach the code via a correlate rather than the theory-specified process. It proposes 'grain calibration' as a solution: decompose the construct into clause-level components, test each component against the text using extractive evidence, and combine the results via an explicit, theory-derived rule. This makes the decision process inspectable, revealing which components drove the code and whether errors stem from missed components or confusion with adjacent constructs. Validation thereby shifts from output agreement to evidence that the instrument operates on the intended construct.

Significance. If operationalized, the approach would address a genuine gap in using LLMs for measurement in fields that rely on text-coded constructs, by providing process-level rather than purely correlational validation. It explicitly credits the shift from scoring outputs against annotators to demonstrating theory-aligned operation. No machine-checked proofs or reproducible code are present, but the proposal is falsifiable in principle through tests of alternative decompositions.

major comments (1)
  1. [Abstract] Abstract, paragraph on grain calibration: the central claim requires that clause-level decomposition plus an explicit combination rule fully reproduces the theory's demands without new mismatches or biases, yet the manuscript provides no procedure for deriving or validating the decomposition from the theory itself. If two theory-informed decompositions of the same construct produce different component sets or rules, the method cannot adjudicate which (if either) measures the intended construct; this is load-bearing for the claim that grain calibration closes the validity gap.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this precise observation on the abstract. We address the point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract, paragraph on grain calibration: the central claim requires that clause-level decomposition plus an explicit combination rule fully reproduces the theory's demands without new mismatches or biases, yet the manuscript provides no procedure for deriving or validating the decomposition from the theory itself. If two theory-informed decompositions of the same construct produce different component sets or rules, the method cannot adjudicate which (if either) measures the intended construct; this is load-bearing for the claim that grain calibration closes the validity gap.

    Authors: Grain calibration treats decomposition as a prior theoretical step, not an output of the method. Researchers derive clause-level components and the combination rule directly from the construct's formal definition in the source theory, exactly as they do when operationalizing any construct for manual coding. The method then supplies extractive tests and an explicit rule so that an LLM's application of those components can be inspected. Different decompositions are adjudicated by their fidelity to the original theory, not by grain calibration itself; the calibration procedure simply makes visible whether the LLM follows the chosen components or substitutes a correlate. This division of labor is standard in measurement theory and does not weaken the claim that grain calibration moves validation from output agreement to process alignment once the components are specified. revision: no

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes grain calibration as a methodological framework for validating LLM coding of theoretical constructs. It describes decomposition into clause-level components, extractive testing, and explicit combination rules without any equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. The central claim concerns a shift in validation approach and does not reduce any result to its own inputs by construction; the method is presented as external to model internals and independent of the constructs being measured.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only; the proposal rests on the domain assumption that constructs admit clause-level decomposition and explicit rule combination without loss of theoretical content.

axioms (2)
  • domain assumption Theoretical constructs can be decomposed into clause-level components that can be tested independently against text
    Core premise of grain calibration described in abstract
  • domain assumption An explicit theory-derived rule can combine component results without introducing new construct mismatches
    Required for the method to provide evidence about the measurement process
invented entities (1)
  • grain calibration no independent evidence
    purpose: Method to validate LLM coding of constructs via explicit component testing
    Newly proposed procedure in the abstract

pith-pipeline@v0.9.1-grok · 5699 in / 1237 out tokens · 32752 ms · 2026-06-30T00:37:11.924014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023

    Ellie Pavlick. Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023. ISSN 1364-503X. doi: 10.1098/rsta.2022.0041. URLhttps: //royalsocietypublishing.org/doi/10.1098/rsta.2022.0041

  2. [2]

    Six principles for evaluating cognitive capabilities in AI models.AI Magazine, 47(2):e70061,

    Melanie Mitchell. Six principles for evaluating cognitive capabilities in AI models.AI Magazine, 47(2):e70061,

  3. [3]

    doi: 10.1002/aaai.70061

  4. [4]

    2420642122

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023. doi: 10.1073/pnas. 2305016120

  5. [5]

    Large language models outperform expert coders and supervised classifiers at annotating political social media messages.Social Science Computer Review, 43(6):1181–1195, 2025

    Petter Törnberg. Large language models outperform expert coders and supervised classifiers at annotating political social media messages.Social Science Computer Review, 43(6):1181–1195, 2025. doi: 10.1177/ 08944393241286471

  6. [6]

    Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024

    Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024. doi: 10. 1162/coli_a_00502

  7. [7]

    Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks.arXiv preprint arXiv:2401.15170, 2024

    Zackary Okun Dunivin. Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks.arXiv preprint arXiv:2401.15170, 2024

  8. [8]

    Robertson, and Jay J

    Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire E. Robertson, and Jay J. Van Bavel. GPT is an effective tool for multilingual psychological text analysis.Proceedings of the National Academy of Sciences, 121(34):e2308950121, 2024. doi: 10.1073/pnas.2308950121. 11 LLMs as theory codersA PREPRINT

  9. [9]

    Michelene T. H. Chi, Paul J. Feltovich, and Robert Glaser. Categorization and Representation of Physics Prob- lems by Experts and Novices.Cognitive Science, 5(2):121–152, 1981. doi: 10.1207/s15516709cog0502_2

  10. [10]

    Bender and Alexander Koller

    Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5185–5198, 2020. doi: 10.18653/v1/2020.acl-main.463

  11. [11]

    Structure-Mapping: A Theoretical Framework for Analogy.Cognitive Science, 7(2):155–170,

    Dedre Gentner. Structure-Mapping: A Theoretical Framework for Analogy.Cognitive Science, 7(2):155–170,

  12. [12]

    doi: 10.1207/s15516709cog0702_3

  13. [13]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computation...

  14. [14]

    Bacharach

    Samuel B. Bacharach. Organizational theories: Some Criteria for Evaluation.Academy of Management Review, 14(4):496–515, 1989. doi: 10.5465/amr.1989.4308374

  15. [15]

    Kenneth MacCorquodale and Paul E. Meehl. On a distinction between hypothetical constructs and intervening variables.Psychological Review, 55(2):95–107, 1948. doi: 10.1037/h0056029

  16. [16]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665– 673, 2020. doi: 10.1038/s42256-020-00257-z

  17. [17]

    Thomas McCoy, Ellie Pavlick, and Tal Linzen

    R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. doi: 10.18653/v1/P19-1334

  18. [18]

    Using large language models for qualitative analy- sis can introduce serious bias.Sociological Methods & Research, 2025

    Julian Ashwin, Aditya Chhabra, and Vijayendra Rao. Using large language models for qualitative analy- sis can introduce serious bias.Sociological Methods & Research, 2025. ISSN 0049-1241. doi: 10.1177/ 00491241251338246

  19. [19]

    Matz, Heinrich Peters, Moran Cerf, Eric Grunenberg, Paul W

    Sandra C. Matz, Heinrich Peters, Moran Cerf, Eric Grunenberg, Paul W. Eastwick, Mitja D. Back, and Eli J. Finkel. Large language models can detect verbal indicators of romantic attraction.Scientific Reports, 2026. doi: 10.1038/s41598-026-52308-x

  20. [20]

    Wojcik, and Peter H

    Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P. Wojcik, and Peter H. Ditto. Moral foundations theory: The pragmatic validity of moral pluralism. In Patricia Devine and Ashby Plant, editors, Advances in Experimental Social Psychology, volume 47, pages 55–130. Academic Press, 2013. doi: 10.1016/ B978-0-12-407236-7.00002-4

  21. [21]

    Stevens, and Morteza Dehghani

    Mohammad Atari, Jonathan Haidt, Jesse Graham, Sena Koleva, Sean T. Stevens, and Morteza Dehghani. Moral- ity beyond the WEIRD: How the nomological network of morality varies across cultures.Journal of Personality and Social Psychology, 125(5):1157–1188, 2023. doi: 10.1037/pspp0000470

  22. [22]

    Large language models meet moral values: A comprehensive assessment of moral abilities.Computers in Human Behavior Reports, 17:100609,

    Luana Bulla, Stefano De Giorgis, Misael Mongiovì, and Aldo Gangemi. Large language models meet moral values: A comprehensive assessment of moral abilities.Computers in Human Behavior Reports, 17:100609,

  23. [23]

    doi: 10.1016/j.chbr.2025.100609

  24. [24]

    Xue, Jackson Trager, Peter S

    Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J. Xue, Jackson Trager, Peter S. Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. Perils and opportunities in using large language models in psychological research.PNAS Nexus, 3(7):pgae245, 2024. doi: 10.1093/pnasnexus/pgae245

  25. [25]

    Bestvater and Burt L

    Samuel E. Bestvater and Burt L. Monroe. Sentiment is not stance: Target-aware opinion classification for political text analysis.Political Analysis, 31(2):235–256, 2023. doi: 10.1017/pan.2022.10

  26. [26]

    A synopsis of linguistic theory, 1930–1955

    John Rupert Firth. A synopsis of linguistic theory, 1930–1955. InStudies in Linguistic Analysis, pages 1–32. Blackwell, Oxford, 1957

  27. [27]

    Ivanova, Idan A

    Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fe- dorenko. Dissociating language and thought in large language models.Trends in Cognitive Sciences, 28(6): 517–540, 2024. ISSN 1364-6613. doi: 10.1016/j.tics.2024.01.011. URLhttps://www.sciencedirect. com/science/article/pii/S1364661324000275

  28. [28]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623, 2021. doi: 10.1145/3442188.3445922

  29. [29]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 12 LLMs as theory codersA PREPRINT

  30. [30]

    A validity-guided workflow for robust large language model research in psychology.arXiv preprint arXiv:2507.04491, 2025

    Zhicheng Lin. A validity-guided workflow for robust large language model research in psychology.arXiv preprint arXiv:2507.04491, 2025. doi: 10.31234/osf.io/xw98v

  31. [31]

    Andrew Halterman and Katherine A. Keith. Codebook LLMs: Evaluating LLMs as measurement tools for political science concepts.Political Analysis, 2025. doi: 10.1017/pan.2025.10017

  32. [32]

    Measuring complex constructs in large-scale text with computational social mixed methods.PsyArXiv, 2025

    Alina Herderich, Jana Lasser, Mirta Galesic, Segun Taofeek Aroyehun, David Garcia, and Joshua Garland. Measuring complex constructs in large-scale text with computational social mixed methods.PsyArXiv, 2025. doi: 10.31234/osf.io/tzc9p. URLhttps://doi.org/10.31234/osf.io/tzc9p

  33. [33]

    Psychological Bulletin , author =

    Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4): 281–302, 1955. doi: 10.1037/h0040957

  34. [34]

    What is a knowledge representation?AI Magazine, 14(1): 17–33, 1993

    Randall Davis, Howard Shrobe, and Peter Szolovits. What is a knowledge representation?AI Magazine, 14(1): 17–33, 1993. doi: 10.1609/aimag.v14i1.1029

  35. [35]

    Samuel Messick. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning.American Psychologist, 50(9):741–749, 1995. doi: 10. 1037/0003-066X.50.9.741

  36. [36]

    Establishing construct validity in LLM capability benchmarks requires nomological networks

    Timo Freiesleben. Establishing construct validity in LLM capability benchmarks requires nomological networks. arXiv preprint arXiv:2603.15121, 2026. doi: 10.48550/arXiv.2603.15121. URLhttps://arxiv.org/abs/ 2603.15121

  37. [37]

    Michael T. Kane. Validating the interpretations and uses of test scores.Journal of Educational Measurement, 50 (1):1–73, 2013. doi: 10.1111/jedm.12000

  38. [38]

    From prompts to constructs: A dual-validity framework for LLM research in psychology.arXiv preprint arXiv:2506.16697, 2025

    Zhicheng Lin. From prompts to constructs: A dual-validity framework for LLM research in psychology.arXiv preprint arXiv:2506.16697, 2025. doi: 10.48550/arXiv.2506.16697

  39. [39]

    Bean, Ryan Othniel Kearns, Angelika Romanou, et al

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, et al. Measuring what matters: Construct validity in large language model benchmarks. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), Datasets and Benchmarks Track, 2025

  40. [40]

    Argyle, James Bisbee, Michael Heseltine, Christopher Lucas, Jon Mellon, Alexis Palmer, Margaret Roberts, and Arthur Spirling

    Christopher Barrie, Lisa P. Argyle, James Bisbee, Michael Heseltine, Christopher Lucas, Jon Mellon, Alexis Palmer, Margaret Roberts, and Arthur Spirling. AI and research methods.APSA Preprints (Cambridge Open Engage), 2026. doi: 10.33774/apsa-2026-h59kk. URLhttps://doi.org/10.33774/apsa-2026-h59kk

  41. [41]

    Prompt stability scoring for text annotation with large language models.arXiv preprint arXiv:2407.02039, 2024

    Christopher Barrie, Panagiota Palaiologou, and Petter Törnberg. Prompt stability scoring for text annotation with large language models.arXiv preprint arXiv:2407.02039, 2024

  42. [42]

    Repurposing annotation guidelines to instruct LLM annotators: A case study

    Kon Woo Kim, Rezarta Islamaj, Jin-Dong Kim, Florian Boudin, and Akiko Aizawa. Repurposing annotation guidelines to instruct LLM annotators: A case study. InInternational Conference on Applications of Natural Language to Information Systems (NLDB 2025), Lecture Notes in Computer Science, pages 140–151. Springer, 2025

  43. [43]

    Do LLMs adhere to label definitions? Examining their receptivity to external label definitions

    Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, and Manas Gaur. Do LLMs adhere to label definitions? Examining their receptivity to external label definitions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 32380–32393, 2025. doi: 10.18...

  44. [44]

    GoLLIE: Annotation guidelines improve zero-shot information-extraction

    Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, and Eneko Agirre. GoLLIE: Annotation guidelines improve zero-shot information-extraction. InProceedings of the Twelfth Inter- national Conference on Learning Representations (ICLR 2024), 2024

  45. [45]

    Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning

    Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Jason Wu. Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023

  46. [46]

    Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

    Ziang Xiao, Xingdi Yuan, Q. Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer. Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. InCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23), pages 75–78, 2023. doi: 10.1145/3581754.3584136

  47. [47]

    Show less, instruct more: Enriching prompts with definitions and guidelines for zero-shot NER.arXiv preprint arXiv:2407.01272, 2024

    Andrew Zamai, Andrea Zugarini, Leonardo Rigutini, Marco Ernandes, and Marco Maggini. Show less, instruct more: Enriching prompts with definitions and guidelines for zero-shot NER.arXiv preprint arXiv:2407.01272, 2024

  48. [48]

    A step-by-step method for cultural annotation by LLMs.Frontiers in Artificial Intelligence, 7:1365508, 2024

    Edgar Dubourg, Valentin Thouzeau, and Nicolas Baumard. A step-by-step method for cultural annotation by LLMs.Frontiers in Artificial Intelligence, 7:1365508, 2024. 13 LLMs as theory codersA PREPRINT

  49. [49]

    Automated annotation with generative AI requires validation.arXiv preprint arXiv:2306.00176, 2023

    Nicholas Pangakis, Samuel Wolken, and Neil Fasching. Automated annotation with generative AI requires validation.arXiv preprint arXiv:2306.00176, 2023

  50. [50]

    Yeager, Christopher J

    Dorottya Demszky, Diyi Yang, David S. Yeager, Christopher J. Bryan, Margarett Clapper, Susannah Chand- hok, Johannes C. Eichstaedt, Cameron Hecht, Jeremy Jamieson, Molly Johnson, Michaela Jones, Desmond Krettek-Cobb, Leslie Lai, Nirel JonesMitchell, Desmond C. Ong, Carol S. Dweck, James J. Gross, and James W. Pennebaker. Using large language models in psy...

  51. [51]

    Oltmanns

    Jocelyn Brickman, Mehak Gupta, and Joshua R. Oltmanns. Large language models for psychological assessment: A comprehensive overview.Advances in Methods and Practices in Psychological Science, 2025. doi: 10.1177/ 25152459251343582

  52. [52]

    The repeated adjustment of measurement protocols (RAMP) method for developing high-validity text classifiers.Psychological Methods, 2025

    Alex Goddard and Alex Gillespie. The repeated adjustment of measurement protocols (RAMP) method for developing high-validity text classifiers.Psychological Methods, 2025. doi: 10.1037/met0000787

  53. [53]

    Moore, Daniel M

    Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza De- hghani. A primer for evaluating large language models in social-science research.Advances in Methods and Practices in Psychological Science, 2025. doi: 10.1177/25152459251325174

  54. [54]

    ValiText: A unified validation framework for com- putational text-based measures of social constructs.arXiv preprint, 2023

    Lukas Birkenmaier, Claudia Wagner, and Clemens Lechner. ValiText: A unified validation framework for com- putational text-based measures of social constructs.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2307.02863

  55. [55]

    Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

    Cristian Espinal Maya. Measuring what cannot be surveyed: LLMs as instruments for latent cognitive variables in labor economics.arXiv preprint arXiv:2604.02403, 2026. doi: 10.48550/arXiv.2604.02403. URLhttps: //arxiv.org/abs/2604.02403

  56. [56]

    Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation

    Chenyu Hou, Gaoxia Zhu, Lishan Zheng, Xiaoshan Huang, Tianlong Zhong, Hanxiang Li, Han Du, and Chin Lee Ker. Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation. InProceedings of the 14th Learning Analytics and Knowledge Conference, pages 518–528, 2024

  57. [57]

    Harding, Björn Ross, and Grégory Renard

    Sandrine Chausson, Marion Fourcade, David J. Harding, Björn Ross, and Grégory Renard. The insight-inference loop: Efficient text classification via natural language inference and threshold-tuning.Sociological Methods & Research, 55(2):568–615, 2026. doi: 10.1177/00491241251326819

  58. [58]

    Enhancing LLM- based data annotation with error decomposition

    Zhen Xu, Vedant Khatri, Yijun Dai, Xiner Liu, Siyan Li, Xuanming Zhang, and Renzhe Yu. Enhancing LLM- based data annotation with error decomposition. InProceedings of the International Conference on Learning Analytics and Knowledge (LAK ’26). ACM, 2026. doi: 10.1145/3785022.3785070. URLhttps://arxiv. org/abs/2601.11920

  59. [59]

    Introducing Qualitative Methods

    Kathy Charmaz.Constructing Grounded Theory. Introducing Qualitative Methods. Sage, London, 2nd edition,

  60. [60]

    Do AI models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

    Claas Beger, Ryan Yi, Shuhao Fu, Kaleda Denton, Arseny Moskvichev, Sarah Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell. Do AI models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025. doi: 10.48550/arXiv.2510.02125

  61. [61]

    Stewart, and Hanying Wei

    Naoki Egami, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. Using imperfect sur- rogates for downstream inference: Design-based supervised learning for social science applica- tions of large language models. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ d862f7f54...