pith. sign in

arxiv: 2509.13324 · v3 · submitted 2025-08-17 · 💻 cs.HC

Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

Pith reviewed 2026-05-18 22:08 UTC · model grok-4.3

classification 💻 cs.HC
keywords psychometric measuresLLM biasracial bias measurementchatbot evaluationSTAMP-LLMbias assessment protocolimplicit bias tests
0
0 comments X

The pith

A two-phase protocol adapts human psychometric standards to create measures for racial bias in chatbot responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STAMP-LLM as a structured framework for designing bias tests for large language models. It splits the process into a first phase that maps psychological constructs to test items and reviews them with experts, then a second phase that controls prompts, generates samples, scores responses, and checks basic reliability. The authors demonstrate it with one explicit and two implicit measures of racial bias. A reader would care because chatbots now influence decisions in hiring, loans, and advice, so without such tools it is hard to know whether they reproduce or amplify existing social biases.

Core claim

STAMP-LLM is a psychometric-based two-phase framework for constructing measures of chatbot bias: the Definitional phase handles construct mapping, item development, and expert review, while the Data/Analysis phase manages prompt control, automated sampling, pre-specified scoring, and initial reliability and validity checks. The framework is illustrated by applying it to racial bias using one explicit and two implicit measures.

What carries the argument

STAMP-LLM, the two-phase framework that first defines the target construct and items through expert review and then controls data collection and scoring to produce standardized bias scores.

If this is right

  • Developers can create additional explicit and implicit bias measures for other social categories using the same two-phase structure.
  • High-stakes applications such as hiring assistants or loan chatbots can be subjected to pre-deployment psychometric testing.
  • Standardized bias scores become possible across different models once the protocol is followed.
  • The approach supplies a template for moving from ad-hoc bias prompts to replicable measurement instruments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same protocol could be extended to measure other forms of bias such as gender or political leanings without major redesign.
  • Validation against downstream real-world effects, such as whether high-scoring chatbots actually change user decisions, would strengthen the framework.
  • The method highlights the need to study how prompt engineering or model size alters the resulting bias scores.

Load-bearing premise

Standard principles of human psychometric test construction can be transferred directly to model-generated text without separate proof that the text corresponds to the intended psychological constructs.

What would settle it

An experiment in which scores produced by the STAMP-LLM measures show no correlation with independent human ratings of biased content in the same chatbot outputs or fail basic test-retest reliability checks.

Figures

Figures reproduced from arXiv: 2509.13324 by Mouhacine Benosman.

Figure 1
Figure 1. Figure 1: Sample of validity tests Measure Spearman Correlation Coefficient p-value Bivariate Normality Test Normality Retest Normality Reliability Interpretation Explicit measure 0.855 <0.001 No No No High test-retest reliability Implicit measure 1 1 <0.001 No No No High test-retest reliability Implicit measure 2 0.997 <0.001 No No No High test-retest reliability [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Artificial intelligence (AI), particularly in the form of large language models (LLMs) or chatbots, has become increasingly integrated into our daily lives. In the past five years, several LLMs have been introduced, including ChatGPT by OpenAI, Claude by Anthropic, and Llama by Meta, among others. These models have the potential to be employed across a wide range of human-machine interaction applications, such as chatbots for information retrieval, assistance in corporate hiring decisions, college admissions, financial loan approvals, parole determinations, and even in medical fields like psychotherapy delivered through chatbots. The key question is whether these chatbots will interact with humans in a bias-free manner or if they will further reinforce the existing pathological biases present in human-to-human interactions. If the latter is true, then how can we rigorously measure these biases? We address this challenge by introducing STAMP-LLM (Standardized Test and Assessment Measurement Protocol for LLMs), a psychometric-based principled two-phase framework for designing psychometric measures to evaluate chatbot biases: (i) a Definitional phase for construct mapping, item development, and expert review; and (ii) a Data/Analysis phase for protocol control (prompts/decoding), automated sampling, pre-specified scoring, and basic reliability/validity checks. We illustrate STAMP-LLM on racial bias using one explicit and two implicit measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes STAMP-LLM, a psychometric-based two-phase framework for designing measures to evaluate biases in chatbots/LLMs. Phase 1 (Definitional) covers construct mapping, item development, and expert review; Phase 2 (Data/Analysis) covers prompt/decoding control, automated sampling, pre-specified scoring, and basic reliability/validity checks. The framework is illustrated by developing one explicit and two implicit measures for racial bias.

Significance. If the framework can be shown to produce valid and reliable bias scores on LLM outputs, it would provide a much-needed standardized protocol for bias measurement in human-AI interaction, especially given the use of chatbots in high-stakes domains. The explicit grounding in psychometric principles and the two-phase structure are constructive contributions.

major comments (2)
  1. [Abstract and Definitional phase] Abstract and § on the Definitional phase: the manuscript applies standard human-psychometric procedures for construct mapping and validity directly to LLM text outputs but supplies no additional argument or evidence establishing why generated text should be treated as functionally equivalent to human responses for the purpose of measuring psychological constructs such as racial bias. This unexamined correspondence assumption is load-bearing for the claim that the resulting measures validly reflect the target constructs.
  2. [Data/Analysis phase] Data/Analysis phase description: the phase is said to include 'pre-specified scoring' and 'basic reliability/validity checks,' yet the manuscript provides neither concrete scoring formulas, example item responses, inter-rater agreement statistics, nor any pilot data demonstrating that the checks can be performed on LLM outputs. Without these operational details the practicality and falsifiability of the protocol cannot be assessed.
minor comments (2)
  1. [Abstract] The abstract uses the phrase 'pathological biases'; a more neutral term such as 'existing societal biases' would improve precision and tone.
  2. [Introduction] A brief comparison table or diagram contrasting STAMP-LLM with prior ad-hoc bias prompts would help readers quickly grasp the claimed methodological advance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript introducing the STAMP-LLM framework. We address each major comment below, indicating where we will revise the paper to incorporate the feedback while preserving the core contribution of the two-phase protocol.

read point-by-point responses
  1. Referee: [Abstract and Definitional phase] Abstract and § on the Definitional phase: the manuscript applies standard human-psychometric procedures for construct mapping and validity directly to LLM text outputs but supplies no additional argument or evidence establishing why generated text should be treated as functionally equivalent to human responses for the purpose of measuring psychological constructs such as racial bias. This unexamined correspondence assumption is load-bearing for the claim that the resulting measures validly reflect the target constructs.

    Authors: We appreciate the referee's identification of this foundational point. The STAMP-LLM framework treats LLM-generated text as the observable response from which bias constructs are inferred, analogous to scoring human responses in psychometric instruments; we do not claim equivalence of internal states but rather that output patterns can be measured using adapted, standardized procedures for expressed bias. To address the concern directly, we will add a new subsection in the Definitional phase that explicitly distinguishes measurement of output bias from attribution of human-like psychology to LLMs, supported by references to prior work on behavioral measurement in AI systems. This clarification will strengthen the rationale without altering the framework's structure. revision: yes

  2. Referee: [Data/Analysis phase] Data/Analysis phase description: the phase is said to include 'pre-specified scoring' and 'basic reliability/validity checks,' yet the manuscript provides neither concrete scoring formulas, example item responses, inter-rater agreement statistics, nor any pilot data demonstrating that the checks can be performed on LLM outputs. Without these operational details the practicality and falsifiability of the protocol cannot be assessed.

    Authors: The referee is correct that the current manuscript presents the Data/Analysis phase at a protocol level with the racial bias measures as an illustration rather than a complete empirical demonstration. We will revise this section to include explicit scoring formulas for the explicit and implicit measures, sample LLM item responses with applied scores, and basic reliability/validity results drawn from our initial protocol applications. Because scoring in the illustrated measures is largely rule-based and automated, traditional inter-rater statistics are not directly applicable; we will clarify the relevant checks (e.g., test-retest consistency on repeated prompts) to improve falsifiability and practicality. revision: yes

Circularity Check

0 steps flagged

No circularity: STAMP-LLM is a proposed protocol, not a self-referential derivation

full rationale

The paper introduces STAMP-LLM as a two-phase framework (Definitional phase for construct mapping/item development/expert review; Data/Analysis phase for sampling/scoring/reliability checks) and illustrates it on racial bias measures. No equations, fitted parameters, or numerical predictions appear that reduce to the framework's own inputs by construction. The central claim is a methodological proposal whose validity rests on future empirical application rather than internal self-definition or self-citation chains. The transfer of human psychometric principles to LLM text is an unexamined assumption (a correctness concern), but it does not create circularity because the paper does not claim to derive or prove that correspondence from within its own structure. The derivation chain is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that psychometric test-construction methods developed for humans transfer to language-model text outputs; no free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Psychometric principles of construct mapping, item development, expert review, and reliability/validity checks can be applied to chatbot responses
    Invoked when the authors state that STAMP-LLM is a 'psychometric-based principled two-phase framework'

pith-pipeline@v0.9.0 · 5777 in / 1327 out tokens · 17358 ms · 2026-05-18T22:08:51.387918+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Branscombe and R

    N. Branscombe and R. Baron. Causes and cures of stereotyping, prejudice, and discrimination. In Social Psychology, Global Edition. Pearson Education, Limited, 2017

  2. [2]

    Berthet and V

    V. Berthet and V. de Gardelle. The heuristics-and-biases inventory: An open-source tool to explore individual differences in rationality. Frontiers in Psychology, 14: 0 1145246, 2023. doi:10.3389/fpsyg.2023.1145246

  3. [3]

    J. Rust, M. Kosinski, and D. Stillwell. Modern psychometrics: the science of psychological assessment. Routledge, 4 edition, 2021

  4. [4]

    A. G. Greenwald, D. E. McGhee, and J. L. K. Schwartz. Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74: 0 1464--1480, 1998. doi:10.1037/0022-3514.74.6.1464

  5. [5]

    M. E. Toplak, R. F. West, and K. E. Stanovich. The cognitive reflection test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition, 39 0 (7): 0 1275--1289, 2011. doi:10.3758/s13421-011-0104-1

  6. [6]

    J. B. McConahay. Modern racism, ambivalence, and the modern racism scale. In J. F. Dovidio and S. L. Gaertner, editors, Prejudice, discrimination, and racism, pages 91--125. Academic Press, London, 1986

  7. [7]

    Glick and S

    P. Glick and S. T. Fiske. The ambivalent sexism inventory: Differentiating hostile and benevolent sexism. Journal of Personality and Social Psychology, 70 0 (3): 0 491--512, 1996. doi:10.1037/0022-3514.70.3.491

  8. [8]

    Wilson and A

    K. Wilson and A. Caliskan. Gender, race, and intersectional bias in resume screening via language model retrieval. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7: 0 1578--1590, 2024. doi:10.1609/aies.v7i1.31748

  9. [9]

    T. Baer. Understand, manage, and prevent algorithmic bias: a guide for business users and data scientists. Apress, 2019

  10. [10]

    M. Garcia. Racist in the machine: The disturbing implications of algorithmic bias. World Policy Journal, 33 0 (4): 0 111--117, 2016. URL https://muse.jhu.edu/article/645268

  11. [11]

    Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

    M. Binz and E. Schulz. Using cognitive psychology to understand gpt3. Proceedings of the National Academy of Sciences, 120 0 (6): 0 1--10, 2023. doi:10.1073/pnas.2218523120

  12. [12]

    arXiv preprint arXiv:2303.13988 , year=

    T. Hagendorff, I. Dasgupta, M. Binz, S. C. Y. Chan, A. Lampinen, J. X. Wang, Z. Akata, and E. Schulz. Machine psychology. arXiv, 0 (arXiv:2303.13988), 2024. URL http://arxiv.org/abs/2303.13988

  13. [13]

    Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024

    M. Kosinski. Theory of mind might have spontaneously emerged in large language models. arXiv, 0 (arXiv:2302.02083), 2023

  14. [14]

    Q. Mei, Y. Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121 0 (9): 0 e2313925121, 2024. doi:10.1073/pnas.2313925121

  15. [15]

    Pellert, C

    M. Pellert, C. M. Lechner, C. Wagner, B. Rammstedt, and M. Strohmaier. Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19 0 (5): 0 808--826, 2024. doi:10.1177/17456916231214460

  16. [16]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava, A. R. Brown, A. Santoro, A. Garriga-Alonso, A. Nie, A. S. Iyer, A. Madotto, A. Chen, A. Gupta, A. Mullokandov, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. doi:10.48550/arxiv.2206.04615

  17. [17]

    R. Liu, T. R. Sumers, I. Dasgupta, and T. L. Griffiths. How do large language models navigate conflicts between honesty and helpfulness? arXiv, 0 (arXiv:2402.07282), 2024. URL http://arxiv.org/abs/2402.07282

  18. [18]

    Zhu and T

    J.-Q. Zhu and T. L. Griffiths. Incoherent probability judgments in large language models. arXiv, 0 (arXiv:2401.16646), 2024. URL http://arxiv.org/abs/2401.16646

  19. [19]

    Buolamwini and T

    J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. Technical report, MIT Media Lab, 2024

  20. [20]

    C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, and Z. Zhu. Breaking bias, building bridges: Evaluation and mitigation of social biases in llms via contact hypothesis. arXiv, 0 (arXiv:2407.02030), 2024. URL http://arxiv.org/abs/2407.02030

  21. [21]

    Y. Chen, S. N. Kirshner, A. Ovchinnikov, M. Andiappan, and T. Jenkin. A manager and an ai walk into a bar: Does chatgpt make biased decisions like we do? Manufacturing & Service Operations Management, 2025. doi:10.1287/msom.2023.0279

  22. [22]

    X. Bai, A. Wang, I. Sucholutsky, and T. L. Griffiths. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences PNAS, 122 0 (8): 0 e2416228122, 2025. doi:10.1073/pnas.2416228122

  23. [23]

    Benosman

    M. Benosman. Psychometric bias measures for chatbots: An application to racial bias measurement. Psychology masters thesis, Harvard University, 2025

  24. [24]

    X. Wang, L. Jiang, J. Hernandez-Orallo, D. Stillwell, L. Sun, F. Luo, and X. Xie. Evaluating general-purpose ai with psychometrics. arXiv, 0 (arXiv:2310.16379), 2023. URL http://arxiv.org/abs/2310.16379

  25. [25]

    Kaplan and Dennis P

    Robert M. Kaplan and Dennis P. Saccuzzo. Psychological Testing: Principles, Applications, and Issues. Wadsworth Cengage Learning, Belmont, CA, 7 edition, 2009

  26. [26]

    J. F. Dovidio, K. Kawakami, and S. L. Gaertner. Implicit and explicit prejudice and interracial interaction. Journal of Personality and Social Psychology, 82 0 (1): 0 62--68, 2002. doi:10.1037/0022-3514.82.1.62

  27. [27]

    P. G. Devine. Stereotypes and prejudice: Their automatic and controlled components. Journal of Personality and Social Psychology, 56 0 (1): 0 5--18, 1989. doi:10.1037/0022-3514.56.1.5

  28. [28]

    J. B. McConahay, B. B. Hardee, and V. Batts. Has racism declined in america? it depends on who is asking and what is asked. The Journal of Conflict Resolution, 25 0 (4): 0 563--579, 1981. doi:10.1177/002200278102500401

  29. [29]

    B. A. Nosek and M. R. Banaji. The go/no-go association task. Social Cognition, 19 0 (6): 0 625--666, 2001. doi:10.1521/soco.19.6.625.20886