Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups
Pith reviewed 2026-05-21 20:17 UTC · model grok-4.3
The pith
ChatGPT codes communication data consistently across gender and racial groups like human raters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a standard collaborative problem-solving coding framework and data from three collaborative task types, the study applies three checks adapted from automated scoring literature to evaluate ChatGPT-based coding. The results demonstrate that ChatGPT-based coding performs consistently in the same way as human raters across gender or racial/ethnic groups.
What carries the argument
Three checks for evaluating subgroup consistency in LLM-based coding, adapted from automated scoring literature, which test whether performance differences appear by demographic group.
If this is right
- ChatGPT coding becomes usable for large-scale assessments of collaboration and communication.
- Manual coding labor can be reduced while keeping fairness across demographic subgroups.
- The consistency supports equitable evaluation in educational and team settings without group-specific biases.
Where Pith is reading between the lines
- The same checks could be tried on other large language models or different coding frameworks to test broader applicability.
- If the pattern holds, researchers might analyze much larger communication datasets without adding demographic bias.
- Extending the method to additional task types or real-world settings would provide a stronger test of its limits.
Load-bearing premise
The three adapted checks from automated scoring literature are sufficient to detect any meaningful subgroup inconsistencies in the LLM coding outputs, and the data from the three collaborative task types are representative of broader communication data.
What would settle it
Applying the same three checks to ChatGPT outputs on a fresh set of collaborative communication data and finding a statistically significant difference in accuracy or consistency for any gender or racial/ethnic subgroup compared with human raters.
read the original abstract
Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that ChatGPT can be instructed with coding rubrics to code communication data from collaborative tasks with accuracy comparable to human raters, and that three checks adapted from automated scoring literature demonstrate consistent performance across gender and racial/ethnic subgroups using data from three collaborative task types.
Significance. If the central claim holds after addressing reporting gaps, the work would support scalable, demographically fair use of LLMs for assessing collaboration and communication in educational and social-science contexts, reducing reliance on labor-intensive human coding.
major comments (3)
- [Abstract] Abstract: the claim of consistent performance across subgroups is presented without any reported sample sizes, specific agreement metrics (e.g., Cohen’s kappa or percent agreement), statistical tests, or exclusion criteria, preventing evaluation of whether the evidence actually supports the central claim.
- [Methods] Methods (description of the three adapted checks): the paper does not justify why checks developed for supervised automated scoring systems are sufficient to detect LLM-specific issues such as prompt sensitivity or differential coding rates that could arise even when aggregate accuracy matches human raters.
- [Results] Results (subgroup analysis): without information on the actual subgroup distributions within each of the three task types or evidence that communication content is balanced across groups, it remains unclear whether the data are representative enough to rule out meaningful inconsistencies.
minor comments (2)
- Provide the exact wording of the prompts used with ChatGPT and any temperature or sampling parameters.
- Include a table or figure that directly compares human and ChatGPT codes broken down by gender and racial/ethnic categories.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We have addressed each major comment point by point below, with plans to revise the manuscript to improve reporting and justification where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of consistent performance across subgroups is presented without any reported sample sizes, specific agreement metrics (e.g., Cohen’s kappa or percent agreement), statistical tests, or exclusion criteria, preventing evaluation of whether the evidence actually supports the central claim.
Authors: We agree that the abstract would be strengthened by including these quantitative details. In the revised version, we will add the sample sizes for each of the three task types and subgroups, report the specific agreement metrics (Cohen’s kappa and percent agreement), reference the statistical tests used, and clarify exclusion criteria. These details will also be expanded in the results and methods sections to allow full evaluation of the evidence. revision: yes
-
Referee: [Methods] Methods (description of the three adapted checks): the paper does not justify why checks developed for supervised automated scoring systems are sufficient to detect LLM-specific issues such as prompt sensitivity or differential coding rates that could arise even when aggregate accuracy matches human raters.
Authors: We acknowledge that additional justification is warranted for adapting these checks to LLM coding. The checks evaluate whether coding performance mirrors human raters across subgroups, which directly tests for inconsistencies. In revision, we will expand the methods to explicitly discuss how the checks can identify LLM-specific concerns, including prompt sensitivity via robustness checks with varied instructions and examination of differential coding rates across groups. This will clarify their sufficiency for the current application. revision: yes
-
Referee: [Results] Results (subgroup analysis): without information on the actual subgroup distributions within each of the three task types or evidence that communication content is balanced across groups, it remains unclear whether the data are representative enough to rule out meaningful inconsistencies.
Authors: We agree that providing these details is essential for assessing representativeness. The tasks involved standardized collaborative activities with participants from diverse backgrounds. In the revised manuscript, we will include tables detailing subgroup distributions (gender and racial/ethnic) and utterance counts for each task type. We will also add discussion of communication content balance, drawing on the task designs and any available content analysis to support that the data allow meaningful checks for inconsistencies. revision: yes
Circularity Check
Empirical subgroup consistency checks contain no circular derivation
full rationale
The paper adapts three checks from existing automated scoring literature and applies them to direct empirical comparisons between ChatGPT-generated codes and independent human rater codes on communication data from three collaborative task types. No equations, predictions, or first-principles results are claimed; the central finding is an observed match in subgroup performance patterns. This is a standard empirical validation study whose results are not forced by any self-definition, fitted input renamed as prediction, or self-citation chain. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The collaborative problem-solving coding framework and rubrics are valid for the tasks studied.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.