arxiv: 2603.23682 · v2 · submitted 2026-03-24 · 💻 cs.HC · cs.AI

Recognition: no theorem link

Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

Licol Zeinfeld , Alona Strugatski , Ziva Bar-Dov , Ron Blonder , Shelley Rap , Giora Alexandron

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:13 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords differential item functioningassessment designlarge language modelsAI in educationpsychometricstest validitychatbot evaluationeducational data mining

0 comments

The pith

A statistical method adapted from bias detection can flag test questions where humans and chatbots systematically differ in performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an approach that applies differential item functioning analysis, long used to spot bias across human groups, to compare response patterns between students and leading large language models on the same items. By combining this with negative control checks and discrimination analysis, the method identifies specific questions where chatbots over- or under-perform humans in consistent ways. Experts then examine those flagged items to describe the task features, such as certain conceptual demands in chemistry, that drive the differences. This matters because it moves assessment design from broad benchmarks toward targeted adjustments that protect validity when AI tools are available. The work demonstrates the approach on a high school chemistry diagnostic and a university entrance exam across six chatbots.

Core claim

By treating humans and LLMs as comparison groups in a differential item functioning framework, the method locates items with statistically significant response differences, uses negative controls to confirm those differences are not artifacts, and links the flagged items to task dimensions such as reasoning type or knowledge demand through expert review.

What carries the argument

Differential Item Functioning (DIF) analysis, which statistically tests whether an item functions differently across groups after controlling for overall ability, here applied to human versus chatbot response distributions and paired with negative control items.

If this is right

Test designers can prioritize or de-emphasize item types based on which dimensions produce large human-AI gaps.
DIF results supply evidence for claims about assessment fairness when AI assistance is possible.
The same pipeline can be repeated on new instruments or updated model versions to track shifts in capability divergence.
Subject-matter experts gain a data-driven starting point for revising items that are vulnerable to AI misuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Periodic re-application of the method could serve as an early-warning system for when new models close or widen specific performance gaps.
The approach might extend to mixed human-AI response datasets, such as students using chatbots during testing, to quantify score inflation risks.
Combining DIF flags with item-level difficulty estimates could support adaptive test construction that balances human and AI challenge levels in real time.

Load-bearing premise

Standard DIF procedures and negative controls carry over to LLM responses without major distortion from prompt wording, training data overlap, or the non-human shape of model output distributions.

What would settle it

Re-administering the same instruments to the same chatbots with varied prompts and finding that the set of DIF-flagged items changes substantially, or that expert reviews of those items yield no consistent task-dimension patterns.

Figures

Figures reproduced from arXiv: 2603.23682 by Alona Strugatski, Giora Alexandron, Licol Zeinfeld, Ron Blonder, Shelley Rap, Ziva Bar-Dov.

**Figure 2.** Figure 2: Human–Chatbot ability distribution. 4.1 Preprocessing. The following properties were computed per item j. Let Xij ∈ {0, 1} denote correctness for respondent i on item j. (i) Ability proxy. We defined an ability proxy for respondent i as the rest score Ri,−j , the respondent’s sum of correct answers across all items except item j. Using rest score for respondent i, as opposed to using the global score of re… view at source ↗

read the original abstract

The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis -- traditionally used to detect bias across demographic groups -- together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o \& 5.2, Gemini 1.5 \& 3 Pro, Claude 3.5 \& 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts DIF analysis to flag items where humans and LLMs diverge on two assessments, giving a workable diagnostic tool for test design even if the core assumptions need more testing.

read the letter

The paper's main contribution is taking differential item functioning, long used to detect bias across human groups, and turning it toward systematic comparisons between student responses and those from six chatbots on a chemistry diagnostic and a university entrance exam. They combine it with negative controls and item-total correlations, then bring in subject experts to interpret the flagged items in terms of task features that make problems easier or harder for the models. This produces a concrete way to spot where assessments are most exposed to AI tools rather than relying on overall accuracy numbers alone. The logic is straightforward and builds directly on established psychometric steps, which makes the method easy to replicate on other instruments. The abstract reports that the approach surfaces clear patterns in model strengths and weaknesses relative to humans. The soft spot is the transfer of DIF assumptions to LLM data. Models do not exhibit human-style guessing or fatigue, their outputs can shift with minor prompt changes, and training data overlap on specific items is possible. The evaluation uses fixed prompts on the six models, so it is not clear how stable the DIF flags would be under realistic variations or whether the negative controls fully address non-human response distributions. Without quantitative details on replication or sensitivity checks in the provided summary, the robustness claim rests on the logical setup more than demonstrated stability. This is for assessment designers and educational researchers who need practical ways to update tests as LLMs spread. It deserves peer review because the extension is timely and grounded in real data, even if referees will likely ask for added checks on prompt sensitivity and cross-validation of the flags.

Referee Report

2 major / 2 minor

Summary. The paper introduces a method combining Differential Item Functioning (DIF) analysis with negative control and item-total correlation analyses to identify items on which human learners and LLMs exhibit systematic differences in responses. This is applied to data from a high school chemistry diagnostic test and a university entrance exam, involving responses from humans and six leading chatbots (ChatGPT-4o & 5.2, Gemini 1.5 & 3 Pro, Claude 3.5 & 4.5 Sonnet). Subject-matter experts then characterize the task dimensions associated with over- or under-performance by LLMs on DIF-flagged items. The central claim is that this DIF-informed approach provides a robust framework for understanding divergences in capabilities and improving assessment design in the AI era.

Significance. If the empirical results support the robustness of the method, it would represent a useful application of established psychometric techniques to a new domain, potentially aiding educators in designing assessments that are more resistant to AI misuse while highlighting specific strengths and weaknesses of current LLMs. The use of negative controls and expert analysis adds to its practical value for assessment adaptation.

major comments (2)

[Evaluation] The evaluation is limited to six fixed chatbots on two instruments without addressing how results might change under varied prompts or different model versions, which is critical given the prompt sensitivity of LLMs. This undermines the generalizability claim for the method in realistic assessment scenarios. (Evaluation section)
[Methods] The manuscript applies standard DIF assumptions to LLM response patterns without additional validation for non-human characteristics such as lack of guessing parameters or potential data leakage; a concrete test for this transferability is needed to support the central claim that flagged items inform assessment vulnerabilities. (Methods section)

minor comments (2)

[Abstract] The abstract lacks any quantitative results, effect sizes, or statistical details from the DIF analysis, which would help readers assess the strength of the findings immediately.
[Notation] Ensure consistent use of terminology for the chatbots across the text to avoid confusion between versions like 4o and 5.2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important considerations for generalizability and methodological assumptions when extending DIF analysis to LLM responses. We address each point below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] The evaluation is limited to six fixed chatbots on two instruments without addressing how results might change under varied prompts or different model versions, which is critical given the prompt sensitivity of LLMs. This undermines the generalizability claim for the method in realistic assessment scenarios. (Evaluation section)

Authors: We agree that prompt sensitivity and model-version variability represent a key limitation for broad generalizability claims. Our evaluation deliberately used six leading models with standard (non-adversarial) prompts to demonstrate the method on current real-world tools. To address the concern, we will revise the Evaluation and Discussion sections to explicitly discuss prompt sensitivity as a limitation, include a recommendation that the DIF procedure be reapplied under varied prompts in practice, and note that the method itself is prompt- and model-agnostic. These additions will clarify the scope without requiring new experiments. revision: partial
Referee: [Methods] The manuscript applies standard DIF assumptions to LLM response patterns without additional validation for non-human characteristics such as lack of guessing parameters or potential data leakage; a concrete test for this transferability is needed to support the central claim that flagged items inform assessment vulnerabilities. (Methods section)

Authors: We acknowledge that standard DIF procedures were developed for human respondents and that LLMs lack guessing parameters and may exhibit data leakage. Our approach already incorporates negative-control items and item-total correlation checks to reduce reliance on parametric assumptions. To provide the requested concrete validation, we will add a short robustness subsection describing a simulation-based test: generating synthetic response matrices that mimic LLM traits (zero guessing, high consistency, possible leakage on certain items) and confirming that the DIF procedure still flags items consistent with known capability differences. This will directly support the transferability claim. revision: yes

Circularity Check

0 steps flagged

Standard DIF applied to new human-LLM response data; no reduction to fitted inputs

full rationale

The paper applies pre-existing psychometric tools (DIF analysis, negative controls, item-total correlations) to response data collected from humans and six fixed chatbots on two instruments. No equations, derivations, or self-citations are shown that reduce the flagged items or task-dimension characterizations to parameters fitted from the same data by construction. The method treats LLM responses as an additional group for standard DIF detection rather than introducing a self-referential framework. This matches the default expectation of no significant circularity for an application paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the transferability of DIF assumptions to LLM responses and the validity of negative controls for this new comparison; no free parameters or invented entities are introduced.

axioms (1)

domain assumption DIF analysis assumptions hold when one group consists of LLM-generated responses rather than human demographic subgroups
Traditional DIF is for human groups; extension to AI is stated without additional justification in the abstract.

pith-pipeline@v0.9.0 · 5609 in / 1123 out tokens · 47057 ms · 2026-05-15T00:13:24.718357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

INTRODUCTION The rapid adoption of generative AI (GenAI 1) tools in ed- ucation has created both opportunities and risks. While these systems, particularly chatbots such as ChatGPT, can provide personalized explanations, feedback, and support for learners, their growing use also poses a profound threat 1In this paper, we use GenAI to refer to generative A...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

PSYCHOMETRIC PRELIMINARIES 2.1 Item–Total Correlation and Rest Score Item–total correlation (ITC) is defined as the correlation between the score on a single item and therest scorefor that item, which is the aggregated performance across all theotheritems in the test (also named ‘corrected ITC’). It assesses the consistency of an item with the rest of the...

work page
[3]

artificial students

METHODOLOGY This section describes how the psychometric measures above were applied to design a statistical method that identifies items that function differently for humans and chatbots (RQ1), and also the process through which DIF items were sub- jected to qualitative analysis by subject matter experts to characterize human–chatbot DIF behavior (RQ2). 3...

work page
[4]

EXPERIMENTS AND RESULTS This section applies the pipeline outlined in Section 3 and illustrated in Fig. 1. The computational part was applied to the psychometric and chemistry data. The qualitative part was applied to the chemistry dataset, and is described in Subsection 4.5. We then conclude with the resulted proce- dure for detecting human:GenAI DIF ite...

work page
[5]

MH-DIF.For each item, we:(a)Stratified the trimmed responses by rest score into up to four quantile bins (or fewer when necessary).(b)ApplieddifMHto compute the MH statistics retrieved below.(c)Retrieved the common odds ratioα MH, the chi-square statistic, and itsp-value.(d) Flagged items as DIF ifp < .05 and the effect size category (defined in Section 2...

work page
[6]

LR-DIF.For each item, we:(a)Fit three nested LR models: (i) Ability-only model. (ii) Uniform DIF model: add group membership as a predictor (iii) Non-uniform DIF model: add the ability×group interaction.(b)Computed likelihood-ratio testp-values for uniform (p uniform) and non- uniform (p nonuniform) DIF.(c)Calculated McFadden’s ∆R 2 Table 2: MH-DIF: Summa...

work page
[7]

In two assessment con- texts and on leading chatbots, we demonstrated that the method provides reliable and stable results

DISCUSSION The LR-based DIF approach developed and piloted in this research provides a method and conceptual framework for identifying assessment items that show differential behav- ior between humans and chatbots. In two assessment con- texts and on leading chatbots, we demonstrated that the method provides reliable and stable results. A key observa- tio...

work page
[8]

The authors thank the Na- tional Institute for Testing and Evaluation for providing ac- cess to psychometric exam data

ACKNOWLEDGMENTS This work was supported by the Knell Family Institute for Artificial Intelligence, Israel. The authors thank the Na- tional Institute for Testing and Evaluation for providing ac- cess to psychometric exam data

work page
[9]

N. Akbari. The AI cheating crisis: Education needs its anti-doping movement, 2024. Retrieved from https://www.edweek.org/technology/opinion-the -ai-cheating-crisis-education-needs-its-anti-d oping-movement/2024/02

work page 2024
[10]

T. Barnes. The q-matrix method: Mining student response data for knowledge. InAmerican association for artificial intelligence 2005 educational data mining workshop, pages 1–8. AAAI Press, Pittsburgh, PA, USA, 2005

work page 2005
[11]

Borges, , et al

B. Borges, , et al. Could chatgpt get an engineering degree? evaluating higher education vulnerability to ai assistants.Proceedings of the National Academy of Sciences, 121(49), 2024

work page 2024
[12]

D. R. E. Cotton, P. A. Cotton, and J. R. S. and. Chatting and cheating: Ensuring academic integrity in the era of ChatGPT.Innovations in Education and Teaching International, 61(2):228–239, 2024

work page 2024
[13]

R. J. De Ayala.The theory and practice of item response theory. Guilford Publications, 2013

work page 2013
[14]

J. M. Echterhoff, Y. Liu, A. Alessa, J. McAuley, and Z. He. Cognitive bias in decision-making with LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, 2024

work page 2024
[15]

A. C. Eggers, G. Tu˜ n´ on, and A. Dafoe. Placebo tests for causal inference.Am. J. Polit. Sci., 68(3):1106–1121, 2024

work page 2024
[16]

M. G. Jodoin and M. J. Gierl. Evaluating type i error and power rates using an effect size measure with the logistic regression procedure for DIF detection.Applied Measurement in Education, 14(4):329–349, 2001

work page 2001
[17]

Khalil and E

M. Khalil and E. Er. Will ChatGPT get you caught? Rethinking of plagiarism detection. InInternational Conference on Human-Computer Interaction, pages 475–487. Springer, 2023

work page 2023
[18]

K. R. Koedinger, E. A. McLaughlin, and J. C. Stamper. Automated student model improvement. International Educational Data Mining Society, 2012

work page 2012
[19]

N. Liu, S. Sonkar, and R. Baraniuk. Do llms make mistakes like students? exploring natural alignments between language models and human error patterns. InInternational Conference on Artificial Intelligence in Education, pages 364–377. Springer, 2025

work page 2025
[20]

Lu et al

P. Lu et al. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022

work page 2022
[21]

Magis, S

D. Magis, S. Beland, F. Tuerlinckx, and P. De Boeck. A general framework and an r package for the detection of dichotomous differential item functioning. Behavior research methods, 42(3):847–862, 2010

work page 2010
[22]

Mantel and W

N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the national cancer institute, 22(4):719–748, 1959

work page 1959
[23]

Martinkov´ a, A

P. Martinkov´ a, A. Drabinov´ a, Y.-L. Liaw, E. A. Sanders, J. L. McFarland, and R. M. Price. Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual assessments.CBE Life Sci. Educ., 16(2), 2017

work page 2017
[24]

Prothero

A. Prothero. New data reveal how many students are using AI to cheat, 2024

work page 2024
[25]

H. J. Rogers and H. Swaminathan. A comparison of logistic regression and mantel-haenszel procedures for detecting differential item functioning.Applied psychological measurement, 17(2):105–116, 1993

work page 1993
[26]

A. Shete. Item analysis: An evaluation of multiple choice questions in physiology examination.J. of Contemporary Med. Educ., 2015

work page 2015
[27]

Sorenson and K

B. Sorenson and K. Hanson. Identifying generative artificial intelligence chatbot use on multiple-choice, general chemistry exams using Rasch analysis.Journal of Chemical Education, 101(8):3216–3223, 2024

work page 2024
[28]

Strugatski and G

A. Strugatski and G. Alexandron. Applying IRT to distinguish between human and generative AI responses to multiple-choice assessments. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, LAK 2025, 2025, pages 817–823. ACM, 2025

work page 2025
[29]

Susnjak and T

T. Susnjak and T. R. McIntosh. ChatGPT: The end of online exam integrity?Education Sciences, 14(6), 2024

work page 2024
[30]

K. D. Wang, E. Burkholder, C. Wieman, S. Salehi, and N. Haber. Examining the potential and pitfalls of ChatGPT in science and engineering problem-solving. InFrontiers in Education, volume 8, page 1330486. Frontiers Media SA, 2024

work page 2024
[31]

X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y. Sun, and W. Wang. Scibench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models.arXiv preprint arXiv:2307.10635, 2023

work page arXiv 2023
[32]

Yacobson, Y

E. Yacobson, Y. Schleifer, Z. Bar-Dov, S. Rap, R. Blonder, and G. Alexandron. Benchmarking llms vs. high school students on standard chemistry exams: Insights for science education. OSF Preprints, 2025

work page 2025
[33]

L. Yan, L. Sha, L. Zhao, Y. Li, R. Martinez-Maldonado, G. Chen, X. Li, Y. Jin, and D. Gaˇ sevi´ c. Practical and ethical challenges of large language models in education: A systematic scoping review.Br. J. Educ. Technol., 55(1):90–112, 2024

work page 2024
[34]

R. Zwick. A review of ets differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement, 2012

work page 2012