Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication
Pith reviewed 2026-06-26 05:17 UTC · model grok-4.3
The pith
Multiple large language models achieve deductive coding reliability on humanitarian transcripts comparable to experienced human coders when using structured prompts and reasoning-enabled setups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multiple LLMs can perform deductive coding at reliability levels comparable to experienced human coders, especially when structured prompts and reasoning-enabled configurations are used. At the same time, aggregate reliability metrics alone are insufficient for deployment decisions. Models varied in recognizing needs expressed indirectly, needs outside predefined categories, and protection-relevant concerns such as physical safety and discrimination.
What carries the argument
The benchmark comparison of 46 LLMs against human expert adjudication on 150 high-fidelity synthetic humanitarian transcripts, using inter-rater reliability testing with Krippendorff's alpha together with discrepancy analysis and humanitarian-specific qualitative criteria.
If this is right
- LLMs can materially expand humanitarian analytical capacity when paired with structured codebooks and reasoning-enabled models.
- Appropriate deployment requires attention to theme-specific performance and tiered oversight focused on categories where miscoding would have the greatest programmatic consequences.
- Open-weights models deployed on self-hosted infrastructure provide a viable path for combining analytical scalability with stronger data governance.
- LLMs should not be treated as substitutes for human judgment in sensitive humanitarian coding tasks.
Where Pith is reading between the lines
- The same structured-prompt and reasoning configuration approach could be tested on qualitative coding tasks in adjacent high-stakes domains such as legal intake or medical case notes.
- Performance gaps on indirect communication suggest that targeted fine-tuning or retrieval-augmented prompts on real examples of non-standard phrasing might close remaining differences with human coders.
- If synthetic transcripts under-represent certain cultural or linguistic patterns, real-world reliability could drop unless the evaluation is repeated on diverse field data.
Load-bearing premise
The 150 high-fidelity synthetic humanitarian transcripts sufficiently capture the variability, indirect communication, and complex needs hierarchies present in real affected-population data.
What would settle it
Apply the same LLMs, prompts, and evaluation protocol to a fresh collection of real field-collected humanitarian transcripts and check whether Krippendorff's alpha stays at human-comparable levels while discrepancy patterns on indirect and protection themes remain stable.
read the original abstract
Data from affected populations are crucial for informing humanitarian response, but their value depends on timely and consistent interpretation of nuanced accounts of need. Humanitarian organizations often lack the staff, time, and specialist expertise required to analyze this information at scale. Large language models (LLMs) may expand this capacity, but their reliability for coding qualitative humanitarian data has not been directly established. This benchmark study compares 46 LLMs to a human Gold Standard using 150 high-fidelity synthetic humanitarian transcripts. Evaluation combined inter-rater reliability testing with Krippendorff's alpha, discrepancy analysis distinguishing correct, near-correct, and incorrect codes, and qualitative assessment across humanitarian-specific criteria including discrimination, complex needs hierarchies, and non-standard communication styles. The authors find that multiple LLMs can perform deductive coding at reliability levels comparable to experienced human coders, especially when structured prompts and reasoning-enabled configurations are used. At the same time, aggregate reliability metrics alone are insufficient for deployment decisions. Models varied in recognizing needs expressed indirectly, needs outside predefined categories, and protection-relevant concerns such as physical safety and discrimination. These findings suggest that LLMs can materially expand humanitarian analytical capacity, but not as substitutes for human judgment. Appropriate use requires structured codebooks, reasoning-enabled models, attention to theme-specific performance, and tiered oversight focused on categories where miscoding would have the greatest programmatic consequences. For sensitive humanitarian data, open-weights models deployed on self-hosted infrastructure may offer a viable path for combining analytical scalability with stronger data governance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a benchmark study comparing 46 LLMs to a human gold standard on deductive coding of 150 high-fidelity synthetic humanitarian transcripts. Evaluation uses Krippendorff's alpha for inter-rater reliability, discrepancy analysis (correct/near-correct/incorrect), and qualitative checks on humanitarian criteria such as indirect expression, complex needs hierarchies, and protection concerns. The central claim is that multiple LLMs (especially with structured prompts and reasoning-enabled settings) reach reliability levels comparable to experienced human coders, while aggregate metrics alone are insufficient and theme-specific performance varies; the authors recommend structured codebooks, model selection, and tiered human oversight rather than substitution.
Significance. If the comparability result holds beyond the synthetic setting, the work could materially increase analytical capacity in resource-constrained humanitarian organizations. The combination of statistical reliability testing with domain-specific qualitative assessment offers a practical evaluation template. Explicit attention to theme-specific risks (e.g., protection concerns) and the call for tiered oversight provide actionable deployment guidance. The scale of models tested (46) and transparent disaggregation of performance strengthen the contribution as an empirical benchmark.
major comments (1)
- [Methods] Methods section (evaluation design): The benchmark and all reported Krippendorff alphas, discrepancy breakdowns, and theme-specific findings rest exclusively on 150 synthetic transcripts. No real-data hold-out, parallel annotation of actual affected-population transcripts, or distributional comparison on dimensions such as indirect communication or protection-relevant concerns is reported. Because the paper frames the results as relevant to 'humanitarian response' and 'affected populations,' this untested distributional match is load-bearing for the headline claim of human-comparable reliability in the target setting.
minor comments (1)
- [Abstract] Abstract: The opening sentence could explicitly note that all quantitative results derive from synthetic transcripts to prevent readers from overgeneralizing the comparability finding.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for identifying a key aspect of our evaluation design. We respond to the major comment below and have incorporated clarifications into the revised manuscript where appropriate.
read point-by-point responses
-
Referee: [Methods] Methods section (evaluation design): The benchmark and all reported Krippendorff alphas, discrepancy breakdowns, and theme-specific findings rest exclusively on 150 synthetic transcripts. No real-data hold-out, parallel annotation of actual affected-population transcripts, or distributional comparison on dimensions such as indirect communication or protection-relevant concerns is reported. Because the paper frames the results as relevant to 'humanitarian response' and 'affected populations,' this untested distributional match is load-bearing for the headline claim of human-comparable reliability in the target setting.
Authors: We acknowledge that the evaluation relies exclusively on the 150 high-fidelity synthetic transcripts and that no parallel real-data annotation or explicit distributional comparison is provided. This design choice stems from the ethical and legal constraints surrounding real humanitarian transcripts, which routinely contain sensitive protection-related information (e.g., physical safety risks, discrimination) that cannot be released or used for open benchmarking without violating data-protection standards and informed-consent requirements. The synthetic transcripts were constructed with direct input from humanitarian domain experts specifically to replicate the linguistic and thematic features highlighted in the referee comment—indirect expression, complex needs hierarchies, non-standard communication, and protection concerns—so that the benchmark could isolate model performance on precisely those dimensions. We agree that the absence of a real-data hold-out constitutes a limitation for direct extrapolation to operational settings. In the revised manuscript we have added an explicit subsection in the Discussion that (a) states the ethical rationale for synthetic data, (b) summarizes the expert-driven construction process used to approximate real distributions, and (c) outlines the conditions under which future real-world validation would be feasible. These additions make the scope and generalizability of the claims more transparent without altering the reported results or the core methodological contribution. revision: partial
Circularity Check
No circularity: pure empirical benchmark against external gold standard
full rationale
The paper is an empirical benchmark study that directly measures LLM performance against a human Gold Standard on 150 synthetic transcripts. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the methods or results. All reliability metrics (Krippendorff's alpha, discrepancy analysis) are computed from observed agreement with external human adjudication, with no reduction to inputs by construction. The synthetic data choice affects generalizability but does not create circularity in the reported claims.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Krippendorff's alpha is an appropriate measure for comparing LLM and human coding reliability in this multi-rater setting
- domain assumption High-fidelity synthetic transcripts capture the key characteristics of real humanitarian data including indirect expression and non-standard communication
Reference graph
Works this paper leans on
-
[1]
http://dx.doi.org/10.3362/9781780448626 32 Alqazlan, L., Fang, Z., Castelle, M., & Procter, R
ACAPS (2014) Humanitarian Needs Assessment: The Good Enough Guide, The Assessment Capacities Project (ACAPS), Emergency Capacity Building Project (ECB) and Practical Action Publishing, Rugby, UK. http://dx.doi.org/10.3362/9781780448626 32 Alqazlan, L., Fang, Z., Castelle, M., & Procter, R. (2025). A novel, human-in-the-loop computational grounded theory f...
-
[2]
https://doi.org/10.1186/s12910- 025-01189-2 Krippendorff, K. (2019). Content analysis: an introduction to its methodology (4th ed.). SAGE. Li, Z., Dohan, D., & Abramson, C. M. (2021). Qualitative coding in the computational era: A hybrid approach to improve reliability and reduce effort for coding ethnographic interviews. Socius,
-
[3]
https://doi.org/10.1177/23780231211062345 Lin, H., & Zhang, Y . (2025). Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research. Social Science Computer Review, 44(3), 403-427. https://doi.org/10.1177/08944393251366243 33 Maitland, C., & Xu, Y . (2015). A social informatics analysis of refugee mobile phone use: A...
-
[4]
arXiv preprint https://doi.org/10.48550/arXiv.2511.14528 Rodríguez Triana, M
LLM-Assisted Thematic Analysis: Opportunities, Limitations, and Recommendations. arXiv preprint https://doi.org/10.48550/arXiv.2511.14528 Rodríguez Triana, M. J., Saban, M., Asensio-Pérez, J. I., Prieto, L. P., Haba-Ortuo, I., Villa- Torrano, C., & Gillet, D. (2025, September). Code-Aware LLM Prompting in Deductive Qualitative Analysis: A Study in Multi-f...
-
[5]
https://www.researchgate.net/publication/395189459_Code- Aware_LLM_Prompting_in_Deductive_Qualitative_Analysis_A_Study_in_Multi- framework_Analysis_of_Learning_Designs Sphere Association. The Sphere Handbook: Humanitarian Charter and Minimum Standards in Humanitarian Response, fourth edition, Geneva, Switzerland,
-
[6]
Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer
www.spherestandards.org/handbook Xiao, Z., Yuan, X., Liao, Q. V ., Abdelghani, R., and Oudeyer, P.-Y . (2023). Supporting qualitative analysis with large language models: combining codebook with GPT-3 for deductive coding. 28th International Conference on Intelligent User Interfaces, 75–78. https://dl.acm.org/doi/pdf/10.1145/3581754.3584136 Zhang, H., Wu,...
-
[7]
What needs do you or your household have in this camp that our assistance has not met so far?
https://doi.org/10.54941/ahfe1006232 34 Appendix 1 Krippendorff’s Alpha and Discrepancy Analysis Results across All Models Provider Model* Gold Standard Relevant Mentioned Incorrect Invalid Relevance Score K’s alpha Google Gemini-3.1-pro (reasoning- medium) 92.60% 6.60% 0.00% 0.90% 0.00% 95.90% 0.922 Google Gemini-3.1-pro 92.40% 6.30% 0.00% 1.30% 0.00% 95...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.