Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Emily Kerzabi; Jiangang Hao; Patrick Kyllonen; Wenju Cui

arxiv: 2510.20584 · v3 · pith:OHJBLP7Knew · submitted 2025-10-23 · 💻 cs.CL · cs.AI

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Jiangang Hao , Wenju Cui , Patrick Kyllonen , Emily Kerzabi This is my paper

Pith reviewed 2026-05-21 20:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ChatGPTautomated codingcommunication datasubgroup consistencycollaborative problem-solvingLLM fairnesslarge-scale assessment

0 comments

The pith

ChatGPT codes communication data consistently across gender and racial groups like human raters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether ChatGPT can code communication data from collaborative tasks without introducing inconsistencies across demographic subgroups such as gender and race. Earlier results showed that ChatGPT reaches human-level accuracy when given explicit coding rubrics, yet the fairness of those outputs across groups had not been checked. The authors adapt three consistency checks from automated scoring research and apply them to data from three types of collaborative problem-solving tasks. They report that the AI outputs align with human ratings in the same way across the examined groups. A reader would care because this pattern, if reliable, removes a major barrier to scaling up assessments of collaboration skills that currently demand heavy manual effort.

Core claim

Using a standard collaborative problem-solving coding framework and data from three collaborative task types, the study applies three checks adapted from automated scoring literature to evaluate ChatGPT-based coding. The results demonstrate that ChatGPT-based coding performs consistently in the same way as human raters across gender or racial/ethnic groups.

What carries the argument

Three checks for evaluating subgroup consistency in LLM-based coding, adapted from automated scoring literature, which test whether performance differences appear by demographic group.

If this is right

ChatGPT coding becomes usable for large-scale assessments of collaboration and communication.
Manual coding labor can be reduced while keeping fairness across demographic subgroups.
The consistency supports equitable evaluation in educational and team settings without group-specific biases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same checks could be tried on other large language models or different coding frameworks to test broader applicability.
If the pattern holds, researchers might analyze much larger communication datasets without adding demographic bias.
Extending the method to additional task types or real-world settings would provide a stronger test of its limits.

Load-bearing premise

The three adapted checks from automated scoring literature are sufficient to detect any meaningful subgroup inconsistencies in the LLM coding outputs, and the data from the three collaborative task types are representative of broader communication data.

What would settle it

Applying the same three checks to ChatGPT outputs on a fresh set of collaborative communication data and finding a statistically significant difference in accuracy or consistency for any gender or racial/ethnic subgroup compared with human raters.

read the original abstract

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChatGPT coding matches human consistency across groups in this setup, but the three borrowed checks may miss generative-model biases.

read the letter

The key takeaway is that the authors took three subgroup-consistency checks from automated scoring, applied them to ChatGPT coding of communication data from collaborative tasks, and found performance that lines up with human raters across gender and racial/ethnic groups. That result is the main thing a colleague should know up front. They address a practical gap: earlier studies showed ChatGPT can reach human-level accuracy on coding rubrics, but left open whether the outputs stay fair when broken down by demographics. Using data from three types of collaborative tasks gives a concrete test bed for that question. The adaptation itself is straightforward and directly useful for anyone scaling up assessments in education or workforce settings. The work is honest about its scope and sticks to empirical comparison rather than claiming a new method. The soft spots are mostly about missing detail and scope. The abstract gives no sample sizes, no exact agreement metrics, and no description of the statistical tests or exclusion rules, so the strength of the consistency claim is hard to judge from the summary alone. More critically, the three checks were designed for fixed, supervised scorers; a prompted generative model like ChatGPT can still produce differential rates through prompt sensitivity or training-data patterns even when aggregate matches look good. The three task types may not cover enough variation in communication style or subgroup distribution to rule that out. This paper is for people working on automated scoring and fairness in educational or workforce assessments. A reader already thinking about LLM use in psychometrics would get a useful data point and a ready framework to build on. It deserves a serious referee because the question is real and the approach is reasonable, even if the methods section will need tightening and perhaps additional checks for prompt effects.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that ChatGPT can be instructed with coding rubrics to code communication data from collaborative tasks with accuracy comparable to human raters, and that three checks adapted from automated scoring literature demonstrate consistent performance across gender and racial/ethnic subgroups using data from three collaborative task types.

Significance. If the central claim holds after addressing reporting gaps, the work would support scalable, demographically fair use of LLMs for assessing collaboration and communication in educational and social-science contexts, reducing reliance on labor-intensive human coding.

major comments (3)

[Abstract] Abstract: the claim of consistent performance across subgroups is presented without any reported sample sizes, specific agreement metrics (e.g., Cohen’s kappa or percent agreement), statistical tests, or exclusion criteria, preventing evaluation of whether the evidence actually supports the central claim.
[Methods] Methods (description of the three adapted checks): the paper does not justify why checks developed for supervised automated scoring systems are sufficient to detect LLM-specific issues such as prompt sensitivity or differential coding rates that could arise even when aggregate accuracy matches human raters.
[Results] Results (subgroup analysis): without information on the actual subgroup distributions within each of the three task types or evidence that communication content is balanced across groups, it remains unclear whether the data are representative enough to rule out meaningful inconsistencies.

minor comments (2)

Provide the exact wording of the prompts used with ChatGPT and any temperature or sampling parameters.
Include a table or figure that directly compares human and ChatGPT codes broken down by gender and racial/ethnic categories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We have addressed each major comment point by point below, with plans to revise the manuscript to improve reporting and justification where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of consistent performance across subgroups is presented without any reported sample sizes, specific agreement metrics (e.g., Cohen’s kappa or percent agreement), statistical tests, or exclusion criteria, preventing evaluation of whether the evidence actually supports the central claim.

Authors: We agree that the abstract would be strengthened by including these quantitative details. In the revised version, we will add the sample sizes for each of the three task types and subgroups, report the specific agreement metrics (Cohen’s kappa and percent agreement), reference the statistical tests used, and clarify exclusion criteria. These details will also be expanded in the results and methods sections to allow full evaluation of the evidence. revision: yes
Referee: [Methods] Methods (description of the three adapted checks): the paper does not justify why checks developed for supervised automated scoring systems are sufficient to detect LLM-specific issues such as prompt sensitivity or differential coding rates that could arise even when aggregate accuracy matches human raters.

Authors: We acknowledge that additional justification is warranted for adapting these checks to LLM coding. The checks evaluate whether coding performance mirrors human raters across subgroups, which directly tests for inconsistencies. In revision, we will expand the methods to explicitly discuss how the checks can identify LLM-specific concerns, including prompt sensitivity via robustness checks with varied instructions and examination of differential coding rates across groups. This will clarify their sufficiency for the current application. revision: yes
Referee: [Results] Results (subgroup analysis): without information on the actual subgroup distributions within each of the three task types or evidence that communication content is balanced across groups, it remains unclear whether the data are representative enough to rule out meaningful inconsistencies.

Authors: We agree that providing these details is essential for assessing representativeness. The tasks involved standardized collaborative activities with participants from diverse backgrounds. In the revised manuscript, we will include tables detailing subgroup distributions (gender and racial/ethnic) and utterance counts for each task type. We will also add discussion of communication content balance, drawing on the task designs and any available content analysis to support that the data allow meaningful checks for inconsistencies. revision: yes

Circularity Check

0 steps flagged

Empirical subgroup consistency checks contain no circular derivation

full rationale

The paper adapts three checks from existing automated scoring literature and applies them to direct empirical comparisons between ChatGPT-generated codes and independent human rater codes on communication data from three collaborative task types. No equations, predictions, or first-principles results are claimed; the central finding is an observed match in subgroup performance patterns. This is a standard empirical validation study whose results are not forced by any self-definition, fitted input renamed as prediction, or self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the pre-existing collaborative problem-solving coding framework and the sufficiency of the three adapted consistency checks; no free parameters, invented entities, or new axioms are introduced.

axioms (1)

domain assumption The collaborative problem-solving coding framework and rubrics are valid for the tasks studied.
Paper relies on this established framework to define the coding categories being evaluated for consistency.

pith-pipeline@v0.9.0 · 5689 in / 1126 out tokens · 63887 ms · 2026-05-21T20:17:53.561469+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.