Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

Mouhacine Benosman

arxiv: 2509.13324 · v3 · submitted 2025-08-17 · 💻 cs.HC

Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

Mouhacine Benosman This is my paper

Pith reviewed 2026-05-18 22:08 UTC · model grok-4.3

classification 💻 cs.HC

keywords psychometric measuresLLM biasracial bias measurementchatbot evaluationSTAMP-LLMbias assessment protocolimplicit bias tests

0 comments

The pith

A two-phase protocol adapts human psychometric standards to create measures for racial bias in chatbot responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STAMP-LLM as a structured framework for designing bias tests for large language models. It splits the process into a first phase that maps psychological constructs to test items and reviews them with experts, then a second phase that controls prompts, generates samples, scores responses, and checks basic reliability. The authors demonstrate it with one explicit and two implicit measures of racial bias. A reader would care because chatbots now influence decisions in hiring, loans, and advice, so without such tools it is hard to know whether they reproduce or amplify existing social biases.

Core claim

STAMP-LLM is a psychometric-based two-phase framework for constructing measures of chatbot bias: the Definitional phase handles construct mapping, item development, and expert review, while the Data/Analysis phase manages prompt control, automated sampling, pre-specified scoring, and initial reliability and validity checks. The framework is illustrated by applying it to racial bias using one explicit and two implicit measures.

What carries the argument

STAMP-LLM, the two-phase framework that first defines the target construct and items through expert review and then controls data collection and scoring to produce standardized bias scores.

If this is right

Developers can create additional explicit and implicit bias measures for other social categories using the same two-phase structure.
High-stakes applications such as hiring assistants or loan chatbots can be subjected to pre-deployment psychometric testing.
Standardized bias scores become possible across different models once the protocol is followed.
The approach supplies a template for moving from ad-hoc bias prompts to replicable measurement instruments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same protocol could be extended to measure other forms of bias such as gender or political leanings without major redesign.
Validation against downstream real-world effects, such as whether high-scoring chatbots actually change user decisions, would strengthen the framework.
The method highlights the need to study how prompt engineering or model size alters the resulting bias scores.

Load-bearing premise

Standard principles of human psychometric test construction can be transferred directly to model-generated text without separate proof that the text corresponds to the intended psychological constructs.

What would settle it

An experiment in which scores produced by the STAMP-LLM measures show no correlation with independent human ratings of biased content in the same chatbot outputs or fail basic test-retest reliability checks.

Figures

Figures reproduced from arXiv: 2509.13324 by Mouhacine Benosman.

**Figure 1.** Figure 1: Sample of validity tests Measure Spearman Correlation Coefficient p-value Bivariate Normality Test Normality Retest Normality Reliability Interpretation Explicit measure 0.855 <0.001 No No No High test-retest reliability Implicit measure 1 1 <0.001 No No No High test-retest reliability Implicit measure 2 0.997 <0.001 No No No High test-retest reliability [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

Artificial intelligence (AI), particularly in the form of large language models (LLMs) or chatbots, has become increasingly integrated into our daily lives. In the past five years, several LLMs have been introduced, including ChatGPT by OpenAI, Claude by Anthropic, and Llama by Meta, among others. These models have the potential to be employed across a wide range of human-machine interaction applications, such as chatbots for information retrieval, assistance in corporate hiring decisions, college admissions, financial loan approvals, parole determinations, and even in medical fields like psychotherapy delivered through chatbots. The key question is whether these chatbots will interact with humans in a bias-free manner or if they will further reinforce the existing pathological biases present in human-to-human interactions. If the latter is true, then how can we rigorously measure these biases? We address this challenge by introducing STAMP-LLM (Standardized Test and Assessment Measurement Protocol for LLMs), a psychometric-based principled two-phase framework for designing psychometric measures to evaluate chatbot biases: (i) a Definitional phase for construct mapping, item development, and expert review; and (ii) a Data/Analysis phase for protocol control (prompts/decoding), automated sampling, pre-specified scoring, and basic reliability/validity checks. We illustrate STAMP-LLM on racial bias using one explicit and two implicit measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAMP-LLM packages standard psychometric steps into a named two-phase protocol for chatbot bias measurement but offers no validation data or justification for treating model text like human responses.

read the letter

The paper introduces STAMP-LLM as a two-phase protocol for designing bias measures for chatbots based on psychometric principles. It focuses on racial bias with explicit and implicit measures as an illustration. What is new is the explicit packaging of these steps under a single name and structure. The definitional phase covers construct mapping, item development, and expert review. The data and analysis phase handles prompt control, automated sampling, scoring, and basic checks for reliability and validity. This could help move bias evaluation away from one-off prompts toward something more repeatable. The paper does a solid job outlining the steps in a logical sequence and referencing standard practices from test construction. That organization is useful for anyone trying to build auditing tools. The soft spots are more significant. No empirical results are presented, so we have no numbers on how well the measures work or any inter-rater reliability. The central assumption that human psychometric validity concepts apply directly to LLM text outputs is not justified or tested. Without that, the framework's claims about measuring bias rest on an unexamined transfer. This paper is for people in HCI or AI ethics who are developing standardized methods for bias assessment in deployed systems. It would be most valuable to readers who need a blueprint rather than ready-to-use metrics. It shows honest engagement with the measurement problem, so it deserves peer review to address the validation gaps and strengthen the justification.

Referee Report

2 major / 2 minor

Summary. The paper proposes STAMP-LLM, a psychometric-based two-phase framework for designing measures to evaluate biases in chatbots/LLMs. Phase 1 (Definitional) covers construct mapping, item development, and expert review; Phase 2 (Data/Analysis) covers prompt/decoding control, automated sampling, pre-specified scoring, and basic reliability/validity checks. The framework is illustrated by developing one explicit and two implicit measures for racial bias.

Significance. If the framework can be shown to produce valid and reliable bias scores on LLM outputs, it would provide a much-needed standardized protocol for bias measurement in human-AI interaction, especially given the use of chatbots in high-stakes domains. The explicit grounding in psychometric principles and the two-phase structure are constructive contributions.

major comments (2)

[Abstract and Definitional phase] Abstract and § on the Definitional phase: the manuscript applies standard human-psychometric procedures for construct mapping and validity directly to LLM text outputs but supplies no additional argument or evidence establishing why generated text should be treated as functionally equivalent to human responses for the purpose of measuring psychological constructs such as racial bias. This unexamined correspondence assumption is load-bearing for the claim that the resulting measures validly reflect the target constructs.
[Data/Analysis phase] Data/Analysis phase description: the phase is said to include 'pre-specified scoring' and 'basic reliability/validity checks,' yet the manuscript provides neither concrete scoring formulas, example item responses, inter-rater agreement statistics, nor any pilot data demonstrating that the checks can be performed on LLM outputs. Without these operational details the practicality and falsifiability of the protocol cannot be assessed.

minor comments (2)

[Abstract] The abstract uses the phrase 'pathological biases'; a more neutral term such as 'existing societal biases' would improve precision and tone.
[Introduction] A brief comparison table or diagram contrasting STAMP-LLM with prior ad-hoc bias prompts would help readers quickly grasp the claimed methodological advance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript introducing the STAMP-LLM framework. We address each major comment below, indicating where we will revise the paper to incorporate the feedback while preserving the core contribution of the two-phase protocol.

read point-by-point responses

Referee: [Abstract and Definitional phase] Abstract and § on the Definitional phase: the manuscript applies standard human-psychometric procedures for construct mapping and validity directly to LLM text outputs but supplies no additional argument or evidence establishing why generated text should be treated as functionally equivalent to human responses for the purpose of measuring psychological constructs such as racial bias. This unexamined correspondence assumption is load-bearing for the claim that the resulting measures validly reflect the target constructs.

Authors: We appreciate the referee's identification of this foundational point. The STAMP-LLM framework treats LLM-generated text as the observable response from which bias constructs are inferred, analogous to scoring human responses in psychometric instruments; we do not claim equivalence of internal states but rather that output patterns can be measured using adapted, standardized procedures for expressed bias. To address the concern directly, we will add a new subsection in the Definitional phase that explicitly distinguishes measurement of output bias from attribution of human-like psychology to LLMs, supported by references to prior work on behavioral measurement in AI systems. This clarification will strengthen the rationale without altering the framework's structure. revision: yes
Referee: [Data/Analysis phase] Data/Analysis phase description: the phase is said to include 'pre-specified scoring' and 'basic reliability/validity checks,' yet the manuscript provides neither concrete scoring formulas, example item responses, inter-rater agreement statistics, nor any pilot data demonstrating that the checks can be performed on LLM outputs. Without these operational details the practicality and falsifiability of the protocol cannot be assessed.

Authors: The referee is correct that the current manuscript presents the Data/Analysis phase at a protocol level with the racial bias measures as an illustration rather than a complete empirical demonstration. We will revise this section to include explicit scoring formulas for the explicit and implicit measures, sample LLM item responses with applied scores, and basic reliability/validity results drawn from our initial protocol applications. Because scoring in the illustrated measures is largely rule-based and automated, traditional inter-rater statistics are not directly applicable; we will clarify the relevant checks (e.g., test-retest consistency on repeated prompts) to improve falsifiability and practicality. revision: yes

Circularity Check

0 steps flagged

No circularity: STAMP-LLM is a proposed protocol, not a self-referential derivation

full rationale

The paper introduces STAMP-LLM as a two-phase framework (Definitional phase for construct mapping/item development/expert review; Data/Analysis phase for sampling/scoring/reliability checks) and illustrates it on racial bias measures. No equations, fitted parameters, or numerical predictions appear that reduce to the framework's own inputs by construction. The central claim is a methodological proposal whose validity rests on future empirical application rather than internal self-definition or self-citation chains. The transfer of human psychometric principles to LLM text is an unexamined assumption (a correctness concern), but it does not create circularity because the paper does not claim to derive or prove that correspondence from within its own structure. The derivation chain is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that psychometric test-construction methods developed for humans transfer to language-model text outputs; no free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption Psychometric principles of construct mapping, item development, expert review, and reliability/validity checks can be applied to chatbot responses
Invoked when the authors state that STAMP-LLM is a 'psychometric-based principled two-phase framework'

pith-pipeline@v0.9.0 · 5777 in / 1327 out tokens · 17358 ms · 2026-05-18T22:08:51.387918+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this challenge by introducing STAMP-LLM ... a psychometric-based principled two-phase framework for designing psychometric measures to evaluate chatbot biases: (i) a Definitional phase for construct mapping, item development, and expert review; and (ii) a Data/Analysis phase for protocol control ...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We build on the definition of algorithmic bias ... to define chatbot racial bias as ‘systematic and repeatable errors in a chatbot’s responses...’

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Branscombe and R

N. Branscombe and R. Baron. Causes and cures of stereotyping, prejudice, and discrimination. In Social Psychology, Global Edition. Pearson Education, Limited, 2017

work page 2017
[2]

Berthet and V

V. Berthet and V. de Gardelle. The heuristics-and-biases inventory: An open-source tool to explore individual differences in rationality. Frontiers in Psychology, 14: 0 1145246, 2023. doi:10.3389/fpsyg.2023.1145246

work page doi:10.3389/fpsyg.2023.1145246 2023
[3]

J. Rust, M. Kosinski, and D. Stillwell. Modern psychometrics: the science of psychological assessment. Routledge, 4 edition, 2021

work page 2021
[4]

A. G. Greenwald, D. E. McGhee, and J. L. K. Schwartz. Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74: 0 1464--1480, 1998. doi:10.1037/0022-3514.74.6.1464

work page doi:10.1037/0022-3514.74.6.1464 1998
[5]

M. E. Toplak, R. F. West, and K. E. Stanovich. The cognitive reflection test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition, 39 0 (7): 0 1275--1289, 2011. doi:10.3758/s13421-011-0104-1

work page doi:10.3758/s13421-011-0104-1 2011
[6]

J. B. McConahay. Modern racism, ambivalence, and the modern racism scale. In J. F. Dovidio and S. L. Gaertner, editors, Prejudice, discrimination, and racism, pages 91--125. Academic Press, London, 1986

work page 1986
[7]

Glick and S

P. Glick and S. T. Fiske. The ambivalent sexism inventory: Differentiating hostile and benevolent sexism. Journal of Personality and Social Psychology, 70 0 (3): 0 491--512, 1996. doi:10.1037/0022-3514.70.3.491

work page doi:10.1037/0022-3514.70.3.491 1996
[8]

Wilson and A

K. Wilson and A. Caliskan. Gender, race, and intersectional bias in resume screening via language model retrieval. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7: 0 1578--1590, 2024. doi:10.1609/aies.v7i1.31748

work page doi:10.1609/aies.v7i1.31748 2024
[9]

T. Baer. Understand, manage, and prevent algorithmic bias: a guide for business users and data scientists. Apress, 2019

work page 2019
[10]

M. Garcia. Racist in the machine: The disturbing implications of algorithmic bias. World Policy Journal, 33 0 (4): 0 111--117, 2016. URL https://muse.jhu.edu/article/645268

work page 2016
[11]

Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

M. Binz and E. Schulz. Using cognitive psychology to understand gpt3. Proceedings of the National Academy of Sciences, 120 0 (6): 0 1--10, 2023. doi:10.1073/pnas.2218523120

work page doi:10.1073/pnas.2218523120 2023
[12]

arXiv preprint arXiv:2303.13988 , year=

T. Hagendorff, I. Dasgupta, M. Binz, S. C. Y. Chan, A. Lampinen, J. X. Wang, Z. Akata, and E. Schulz. Machine psychology. arXiv, 0 (arXiv:2303.13988), 2024. URL http://arxiv.org/abs/2303.13988

work page arXiv 2024
[13]

Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024

M. Kosinski. Theory of mind might have spontaneously emerged in large language models. arXiv, 0 (arXiv:2302.02083), 2023

work page arXiv 2023
[14]

Q. Mei, Y. Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121 0 (9): 0 e2313925121, 2024. doi:10.1073/pnas.2313925121

work page doi:10.1073/pnas.2313925121 2024
[15]

Pellert, C

M. Pellert, C. M. Lechner, C. Wagner, B. Rammstedt, and M. Strohmaier. Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19 0 (5): 0 808--826, 2024. doi:10.1177/17456916231214460

work page doi:10.1177/17456916231214460 2024
[16]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. R. Brown, A. Santoro, A. Garriga-Alonso, A. Nie, A. S. Iyer, A. Madotto, A. Chen, A. Gupta, A. Mullokandov, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. doi:10.48550/arxiv.2206.04615

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.04615 2023
[17]

R. Liu, T. R. Sumers, I. Dasgupta, and T. L. Griffiths. How do large language models navigate conflicts between honesty and helpfulness? arXiv, 0 (arXiv:2402.07282), 2024. URL http://arxiv.org/abs/2402.07282

work page arXiv 2024
[18]

Zhu and T

J.-Q. Zhu and T. L. Griffiths. Incoherent probability judgments in large language models. arXiv, 0 (arXiv:2401.16646), 2024. URL http://arxiv.org/abs/2401.16646

work page arXiv 2024
[19]

Buolamwini and T

J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. Technical report, MIT Media Lab, 2024

work page 2024
[20]

C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, and Z. Zhu. Breaking bias, building bridges: Evaluation and mitigation of social biases in llms via contact hypothesis. arXiv, 0 (arXiv:2407.02030), 2024. URL http://arxiv.org/abs/2407.02030

work page arXiv 2024
[21]

Y. Chen, S. N. Kirshner, A. Ovchinnikov, M. Andiappan, and T. Jenkin. A manager and an ai walk into a bar: Does chatgpt make biased decisions like we do? Manufacturing & Service Operations Management, 2025. doi:10.1287/msom.2023.0279

work page doi:10.1287/msom.2023.0279 2025
[22]

X. Bai, A. Wang, I. Sucholutsky, and T. L. Griffiths. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences PNAS, 122 0 (8): 0 e2416228122, 2025. doi:10.1073/pnas.2416228122

work page doi:10.1073/pnas.2416228122 2025
[23]

Benosman

M. Benosman. Psychometric bias measures for chatbots: An application to racial bias measurement. Psychology masters thesis, Harvard University, 2025

work page 2025
[24]

X. Wang, L. Jiang, J. Hernandez-Orallo, D. Stillwell, L. Sun, F. Luo, and X. Xie. Evaluating general-purpose ai with psychometrics. arXiv, 0 (arXiv:2310.16379), 2023. URL http://arxiv.org/abs/2310.16379

work page arXiv 2023
[25]

Kaplan and Dennis P

Robert M. Kaplan and Dennis P. Saccuzzo. Psychological Testing: Principles, Applications, and Issues. Wadsworth Cengage Learning, Belmont, CA, 7 edition, 2009

work page 2009
[26]

J. F. Dovidio, K. Kawakami, and S. L. Gaertner. Implicit and explicit prejudice and interracial interaction. Journal of Personality and Social Psychology, 82 0 (1): 0 62--68, 2002. doi:10.1037/0022-3514.82.1.62

work page doi:10.1037/0022-3514.82.1.62 2002
[27]

P. G. Devine. Stereotypes and prejudice: Their automatic and controlled components. Journal of Personality and Social Psychology, 56 0 (1): 0 5--18, 1989. doi:10.1037/0022-3514.56.1.5

work page doi:10.1037/0022-3514.56.1.5 1989
[28]

J. B. McConahay, B. B. Hardee, and V. Batts. Has racism declined in america? it depends on who is asking and what is asked. The Journal of Conflict Resolution, 25 0 (4): 0 563--579, 1981. doi:10.1177/002200278102500401

work page doi:10.1177/002200278102500401 1981
[29]

B. A. Nosek and M. R. Banaji. The go/no-go association task. Social Cognition, 19 0 (6): 0 625--666, 2001. doi:10.1521/soco.19.6.625.20886

work page doi:10.1521/soco.19.6.625.20886 2001

[1] [1]

Branscombe and R

N. Branscombe and R. Baron. Causes and cures of stereotyping, prejudice, and discrimination. In Social Psychology, Global Edition. Pearson Education, Limited, 2017

work page 2017

[2] [2]

Berthet and V

V. Berthet and V. de Gardelle. The heuristics-and-biases inventory: An open-source tool to explore individual differences in rationality. Frontiers in Psychology, 14: 0 1145246, 2023. doi:10.3389/fpsyg.2023.1145246

work page doi:10.3389/fpsyg.2023.1145246 2023

[3] [3]

J. Rust, M. Kosinski, and D. Stillwell. Modern psychometrics: the science of psychological assessment. Routledge, 4 edition, 2021

work page 2021

[4] [4]

A. G. Greenwald, D. E. McGhee, and J. L. K. Schwartz. Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74: 0 1464--1480, 1998. doi:10.1037/0022-3514.74.6.1464

work page doi:10.1037/0022-3514.74.6.1464 1998

[5] [5]

M. E. Toplak, R. F. West, and K. E. Stanovich. The cognitive reflection test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition, 39 0 (7): 0 1275--1289, 2011. doi:10.3758/s13421-011-0104-1

work page doi:10.3758/s13421-011-0104-1 2011

[6] [6]

J. B. McConahay. Modern racism, ambivalence, and the modern racism scale. In J. F. Dovidio and S. L. Gaertner, editors, Prejudice, discrimination, and racism, pages 91--125. Academic Press, London, 1986

work page 1986

[7] [7]

Glick and S

P. Glick and S. T. Fiske. The ambivalent sexism inventory: Differentiating hostile and benevolent sexism. Journal of Personality and Social Psychology, 70 0 (3): 0 491--512, 1996. doi:10.1037/0022-3514.70.3.491

work page doi:10.1037/0022-3514.70.3.491 1996

[8] [8]

Wilson and A

K. Wilson and A. Caliskan. Gender, race, and intersectional bias in resume screening via language model retrieval. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7: 0 1578--1590, 2024. doi:10.1609/aies.v7i1.31748

work page doi:10.1609/aies.v7i1.31748 2024

[9] [9]

T. Baer. Understand, manage, and prevent algorithmic bias: a guide for business users and data scientists. Apress, 2019

work page 2019

[10] [10]

M. Garcia. Racist in the machine: The disturbing implications of algorithmic bias. World Policy Journal, 33 0 (4): 0 111--117, 2016. URL https://muse.jhu.edu/article/645268

work page 2016

[11] [11]

Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

M. Binz and E. Schulz. Using cognitive psychology to understand gpt3. Proceedings of the National Academy of Sciences, 120 0 (6): 0 1--10, 2023. doi:10.1073/pnas.2218523120

work page doi:10.1073/pnas.2218523120 2023

[12] [12]

arXiv preprint arXiv:2303.13988 , year=

T. Hagendorff, I. Dasgupta, M. Binz, S. C. Y. Chan, A. Lampinen, J. X. Wang, Z. Akata, and E. Schulz. Machine psychology. arXiv, 0 (arXiv:2303.13988), 2024. URL http://arxiv.org/abs/2303.13988

work page arXiv 2024

[13] [13]

Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024

M. Kosinski. Theory of mind might have spontaneously emerged in large language models. arXiv, 0 (arXiv:2302.02083), 2023

work page arXiv 2023

[14] [14]

Q. Mei, Y. Xie, W. Yuan, and M. O. Jackson. A turing test of whether ai chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121 0 (9): 0 e2313925121, 2024. doi:10.1073/pnas.2313925121

work page doi:10.1073/pnas.2313925121 2024

[15] [15]

Pellert, C

M. Pellert, C. M. Lechner, C. Wagner, B. Rammstedt, and M. Strohmaier. Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19 0 (5): 0 808--826, 2024. doi:10.1177/17456916231214460

work page doi:10.1177/17456916231214460 2024

[16] [16]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. R. Brown, A. Santoro, A. Garriga-Alonso, A. Nie, A. S. Iyer, A. Madotto, A. Chen, A. Gupta, A. Mullokandov, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. doi:10.48550/arxiv.2206.04615

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.04615 2023

[17] [17]

R. Liu, T. R. Sumers, I. Dasgupta, and T. L. Griffiths. How do large language models navigate conflicts between honesty and helpfulness? arXiv, 0 (arXiv:2402.07282), 2024. URL http://arxiv.org/abs/2402.07282

work page arXiv 2024

[18] [18]

Zhu and T

J.-Q. Zhu and T. L. Griffiths. Incoherent probability judgments in large language models. arXiv, 0 (arXiv:2401.16646), 2024. URL http://arxiv.org/abs/2401.16646

work page arXiv 2024

[19] [19]

Buolamwini and T

J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. Technical report, MIT Media Lab, 2024

work page 2024

[20] [20]

C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, and Z. Zhu. Breaking bias, building bridges: Evaluation and mitigation of social biases in llms via contact hypothesis. arXiv, 0 (arXiv:2407.02030), 2024. URL http://arxiv.org/abs/2407.02030

work page arXiv 2024

[21] [21]

Y. Chen, S. N. Kirshner, A. Ovchinnikov, M. Andiappan, and T. Jenkin. A manager and an ai walk into a bar: Does chatgpt make biased decisions like we do? Manufacturing & Service Operations Management, 2025. doi:10.1287/msom.2023.0279

work page doi:10.1287/msom.2023.0279 2025

[22] [22]

X. Bai, A. Wang, I. Sucholutsky, and T. L. Griffiths. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences PNAS, 122 0 (8): 0 e2416228122, 2025. doi:10.1073/pnas.2416228122

work page doi:10.1073/pnas.2416228122 2025

[23] [23]

Benosman

M. Benosman. Psychometric bias measures for chatbots: An application to racial bias measurement. Psychology masters thesis, Harvard University, 2025

work page 2025

[24] [24]

X. Wang, L. Jiang, J. Hernandez-Orallo, D. Stillwell, L. Sun, F. Luo, and X. Xie. Evaluating general-purpose ai with psychometrics. arXiv, 0 (arXiv:2310.16379), 2023. URL http://arxiv.org/abs/2310.16379

work page arXiv 2023

[25] [25]

Kaplan and Dennis P

Robert M. Kaplan and Dennis P. Saccuzzo. Psychological Testing: Principles, Applications, and Issues. Wadsworth Cengage Learning, Belmont, CA, 7 edition, 2009

work page 2009

[26] [26]

J. F. Dovidio, K. Kawakami, and S. L. Gaertner. Implicit and explicit prejudice and interracial interaction. Journal of Personality and Social Psychology, 82 0 (1): 0 62--68, 2002. doi:10.1037/0022-3514.82.1.62

work page doi:10.1037/0022-3514.82.1.62 2002

[27] [27]

P. G. Devine. Stereotypes and prejudice: Their automatic and controlled components. Journal of Personality and Social Psychology, 56 0 (1): 0 5--18, 1989. doi:10.1037/0022-3514.56.1.5

work page doi:10.1037/0022-3514.56.1.5 1989

[28] [28]

J. B. McConahay, B. B. Hardee, and V. Batts. Has racism declined in america? it depends on who is asking and what is asked. The Journal of Conflict Resolution, 25 0 (4): 0 563--579, 1981. doi:10.1177/002200278102500401

work page doi:10.1177/002200278102500401 1981

[29] [29]

B. A. Nosek and M. R. Banaji. The go/no-go association task. Social Cognition, 19 0 (6): 0 625--666, 2001. doi:10.1521/soco.19.6.625.20886

work page doi:10.1521/soco.19.6.625.20886 2001