Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication

Ella Boone; Jerome Marston; Patrick Vinck; Phuong N Pham; Salom\'e Garnier; Tino Kreutzer

arxiv: 2606.26541 · v1 · pith:C3MX2TKWnew · submitted 2026-06-25 · 💻 cs.LG · cs.CY

Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication

Jerome Marston , Tino Kreutzer , Salom\'e Garnier , Ella Boone , Phuong N Pham , Patrick Vinck This is my paper

Pith reviewed 2026-06-26 05:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CY

keywords large language modelsqualitative codinghumanitarian datadeductive codinginter-rater reliabilityKrippendorff's alphasynthetic transcripts

0 comments

The pith

Multiple large language models achieve deductive coding reliability on humanitarian transcripts comparable to experienced human coders when using structured prompts and reasoning-enabled setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can reliably interpret and code qualitative data from affected populations in humanitarian contexts. It compares 46 models against a human gold standard on 150 synthetic transcripts, measuring agreement with Krippendorff's alpha, analyzing coding discrepancies, and checking performance on criteria such as indirect expression and complex needs hierarchies. A sympathetic reader would care because humanitarian organizations often lack staff and time to analyze nuanced accounts of need at scale. If the central claim holds, LLMs could expand analytical capacity while still requiring human oversight rather than acting as full substitutes. The study stresses that overall reliability numbers are not enough on their own for real-world use and points to variation across specific themes like protection concerns.

Core claim

Multiple LLMs can perform deductive coding at reliability levels comparable to experienced human coders, especially when structured prompts and reasoning-enabled configurations are used. At the same time, aggregate reliability metrics alone are insufficient for deployment decisions. Models varied in recognizing needs expressed indirectly, needs outside predefined categories, and protection-relevant concerns such as physical safety and discrimination.

What carries the argument

The benchmark comparison of 46 LLMs against human expert adjudication on 150 high-fidelity synthetic humanitarian transcripts, using inter-rater reliability testing with Krippendorff's alpha together with discrepancy analysis and humanitarian-specific qualitative criteria.

If this is right

LLMs can materially expand humanitarian analytical capacity when paired with structured codebooks and reasoning-enabled models.
Appropriate deployment requires attention to theme-specific performance and tiered oversight focused on categories where miscoding would have the greatest programmatic consequences.
Open-weights models deployed on self-hosted infrastructure provide a viable path for combining analytical scalability with stronger data governance.
LLMs should not be treated as substitutes for human judgment in sensitive humanitarian coding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured-prompt and reasoning configuration approach could be tested on qualitative coding tasks in adjacent high-stakes domains such as legal intake or medical case notes.
Performance gaps on indirect communication suggest that targeted fine-tuning or retrieval-augmented prompts on real examples of non-standard phrasing might close remaining differences with human coders.
If synthetic transcripts under-represent certain cultural or linguistic patterns, real-world reliability could drop unless the evaluation is repeated on diverse field data.

Load-bearing premise

The 150 high-fidelity synthetic humanitarian transcripts sufficiently capture the variability, indirect communication, and complex needs hierarchies present in real affected-population data.

What would settle it

Apply the same LLMs, prompts, and evaluation protocol to a fresh collection of real field-collected humanitarian transcripts and check whether Krippendorff's alpha stays at human-comparable levels while discrepancy patterns on indirect and protection themes remain stable.

read the original abstract

Data from affected populations are crucial for informing humanitarian response, but their value depends on timely and consistent interpretation of nuanced accounts of need. Humanitarian organizations often lack the staff, time, and specialist expertise required to analyze this information at scale. Large language models (LLMs) may expand this capacity, but their reliability for coding qualitative humanitarian data has not been directly established. This benchmark study compares 46 LLMs to a human Gold Standard using 150 high-fidelity synthetic humanitarian transcripts. Evaluation combined inter-rater reliability testing with Krippendorff's alpha, discrepancy analysis distinguishing correct, near-correct, and incorrect codes, and qualitative assessment across humanitarian-specific criteria including discrimination, complex needs hierarchies, and non-standard communication styles. The authors find that multiple LLMs can perform deductive coding at reliability levels comparable to experienced human coders, especially when structured prompts and reasoning-enabled configurations are used. At the same time, aggregate reliability metrics alone are insufficient for deployment decisions. Models varied in recognizing needs expressed indirectly, needs outside predefined categories, and protection-relevant concerns such as physical safety and discrimination. These findings suggest that LLMs can materially expand humanitarian analytical capacity, but not as substitutes for human judgment. Appropriate use requires structured codebooks, reasoning-enabled models, attention to theme-specific performance, and tiered oversight focused on categories where miscoding would have the greatest programmatic consequences. For sensitive humanitarian data, open-weights models deployed on self-hosted infrastructure may offer a viable path for combining analytical scalability with stronger data governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs reach human-comparable coding reliability on these synthetics with structured prompts, but the real-data gap is the main limit.

read the letter

The main thing to know is that several LLMs hit reliability levels close to experienced human coders on deductive coding of 150 synthetic humanitarian transcripts, especially with structured prompts and reasoning setups, but the results stay conditional on how well those synthetics represent actual field data.

The paper does solid empirical work by testing 46 models against a human gold standard. It uses Krippendorff's alpha for overall reliability, breaks errors into correct/near-correct/incorrect categories, and adds qualitative checks on humanitarian angles like indirect expression, needs outside the codebook, and protection concerns such as discrimination or safety. That combination gives more actionable detail than accuracy alone, and the practical takeaway on prompt design and model choice is useful.

The soft spot is the exclusive use of synthetic transcripts. The design treats these as stand-ins for real affected-population accounts with their indirect styles and complex hierarchies, yet no parallel real-data annotation or hold-out set is reported. If the synthetics under-represent that variability, the comparability claim does not transfer to the deployment setting the authors have in mind. The paper flags the need for theme-specific oversight, which is fair, but it does not close the generalizability question.

This is worth a serious referee for humanitarian practitioners and applied NLP researchers who need benchmarks on this task. The evaluation is careful and the limits are stated plainly, so review can focus on whether real transcripts change the picture.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a benchmark study comparing 46 LLMs to a human gold standard on deductive coding of 150 high-fidelity synthetic humanitarian transcripts. Evaluation uses Krippendorff's alpha for inter-rater reliability, discrepancy analysis (correct/near-correct/incorrect), and qualitative checks on humanitarian criteria such as indirect expression, complex needs hierarchies, and protection concerns. The central claim is that multiple LLMs (especially with structured prompts and reasoning-enabled settings) reach reliability levels comparable to experienced human coders, while aggregate metrics alone are insufficient and theme-specific performance varies; the authors recommend structured codebooks, model selection, and tiered human oversight rather than substitution.

Significance. If the comparability result holds beyond the synthetic setting, the work could materially increase analytical capacity in resource-constrained humanitarian organizations. The combination of statistical reliability testing with domain-specific qualitative assessment offers a practical evaluation template. Explicit attention to theme-specific risks (e.g., protection concerns) and the call for tiered oversight provide actionable deployment guidance. The scale of models tested (46) and transparent disaggregation of performance strengthen the contribution as an empirical benchmark.

major comments (1)

[Methods] Methods section (evaluation design): The benchmark and all reported Krippendorff alphas, discrepancy breakdowns, and theme-specific findings rest exclusively on 150 synthetic transcripts. No real-data hold-out, parallel annotation of actual affected-population transcripts, or distributional comparison on dimensions such as indirect communication or protection-relevant concerns is reported. Because the paper frames the results as relevant to 'humanitarian response' and 'affected populations,' this untested distributional match is load-bearing for the headline claim of human-comparable reliability in the target setting.

minor comments (1)

[Abstract] Abstract: The opening sentence could explicitly note that all quantitative results derive from synthetic transcripts to prevent readers from overgeneralizing the comparability finding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for identifying a key aspect of our evaluation design. We respond to the major comment below and have incorporated clarifications into the revised manuscript where appropriate.

read point-by-point responses

Referee: [Methods] Methods section (evaluation design): The benchmark and all reported Krippendorff alphas, discrepancy breakdowns, and theme-specific findings rest exclusively on 150 synthetic transcripts. No real-data hold-out, parallel annotation of actual affected-population transcripts, or distributional comparison on dimensions such as indirect communication or protection-relevant concerns is reported. Because the paper frames the results as relevant to 'humanitarian response' and 'affected populations,' this untested distributional match is load-bearing for the headline claim of human-comparable reliability in the target setting.

Authors: We acknowledge that the evaluation relies exclusively on the 150 high-fidelity synthetic transcripts and that no parallel real-data annotation or explicit distributional comparison is provided. This design choice stems from the ethical and legal constraints surrounding real humanitarian transcripts, which routinely contain sensitive protection-related information (e.g., physical safety risks, discrimination) that cannot be released or used for open benchmarking without violating data-protection standards and informed-consent requirements. The synthetic transcripts were constructed with direct input from humanitarian domain experts specifically to replicate the linguistic and thematic features highlighted in the referee comment—indirect expression, complex needs hierarchies, non-standard communication, and protection concerns—so that the benchmark could isolate model performance on precisely those dimensions. We agree that the absence of a real-data hold-out constitutes a limitation for direct extrapolation to operational settings. In the revised manuscript we have added an explicit subsection in the Discussion that (a) states the ethical rationale for synthetic data, (b) summarizes the expert-driven construction process used to approximate real distributions, and (c) outlines the conditions under which future real-world validation would be feasible. These additions make the scope and generalizability of the claims more transparent without altering the reported results or the core methodological contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark against external gold standard

full rationale

The paper is an empirical benchmark study that directly measures LLM performance against a human Gold Standard on 150 synthetic transcripts. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the methods or results. All reliability metrics (Krippendorff's alpha, discrepancy analysis) are computed from observed agreement with external human adjudication, with no reduction to inputs by construction. The synthetic data choice affects generalizability but does not create circularity in the reported claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that synthetic transcripts are representative and that Krippendorff's alpha plus discrepancy categories are sufficient to establish deployment readiness.

axioms (2)

standard math Krippendorff's alpha is an appropriate measure for comparing LLM and human coding reliability in this multi-rater setting
Invoked for the primary quantitative evaluation against the human gold standard.
domain assumption High-fidelity synthetic transcripts capture the key characteristics of real humanitarian data including indirect expression and non-standard communication
The study design uses 150 synthetic transcripts as the sole test set.

pith-pipeline@v0.9.1-grok · 5823 in / 1352 out tokens · 30686 ms · 2026-06-26T05:17:24.742165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 6 canonical work pages

[1]

http://dx.doi.org/10.3362/9781780448626 32 Alqazlan, L., Fang, Z., Castelle, M., & Procter, R

ACAPS (2014) Humanitarian Needs Assessment: The Good Enough Guide, The Assessment Capacities Project (ACAPS), Emergency Capacity Building Project (ECB) and Practical Action Publishing, Rugby, UK. http://dx.doi.org/10.3362/9781780448626 32 Alqazlan, L., Fang, Z., Castelle, M., & Procter, R. (2025). A novel, human-in-the-loop computational grounded theory f...

work page doi:10.3362/9781780448626 2014
[2]

https://doi.org/10.1186/s12910- 025-01189-2 Krippendorff, K. (2019). Content analysis: an introduction to its methodology (4th ed.). SAGE. Li, Z., Dohan, D., & Abramson, C. M. (2021). Qualitative coding in the computational era: A hybrid approach to improve reliability and reduce effort for coding ethnographic interviews. Socius,

work page doi:10.1186/s12910- 2019
[3]

https://doi.org/10.1177/23780231211062345 Lin, H., & Zhang, Y . (2025). Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research. Social Science Computer Review, 44(3), 403-427. https://doi.org/10.1177/08944393251366243 33 Maitland, C., & Xu, Y . (2015). A social informatics analysis of refugee mobile phone use: A...

work page doi:10.1177/23780231211062345 2025
[4]

arXiv preprint https://doi.org/10.48550/arXiv.2511.14528 Rodríguez Triana, M

LLM-Assisted Thematic Analysis: Opportunities, Limitations, and Recommendations. arXiv preprint https://doi.org/10.48550/arXiv.2511.14528 Rodríguez Triana, M. J., Saban, M., Asensio-Pérez, J. I., Prieto, L. P., Haba-Ortuo, I., Villa- Torrano, C., & Gillet, D. (2025, September). Code-Aware LLM Prompting in Deductive Qualitative Analysis: A Study in Multi-f...

work page doi:10.48550/arxiv.2511.14528 2025
[5]

The Sphere Handbook: Humanitarian Charter and Minimum Standards in Humanitarian Response, fourth edition, Geneva, Switzerland,

https://www.researchgate.net/publication/395189459_Code- Aware_LLM_Prompting_in_Deductive_Qualitative_Analysis_A_Study_in_Multi- framework_Analysis_of_Learning_Designs Sphere Association. The Sphere Handbook: Humanitarian Charter and Minimum Standards in Humanitarian Response, fourth edition, Geneva, Switzerland,

arXiv
[6]

Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

www.spherestandards.org/handbook Xiao, Z., Yuan, X., Liao, Q. V ., Abdelghani, R., and Oudeyer, P.-Y . (2023). Supporting qualitative analysis with large language models: combining codebook with GPT-3 for deductive coding. 28th International Conference on Intelligent User Interfaces, 75–78. https://dl.acm.org/doi/pdf/10.1145/3581754.3584136 Zhang, H., Wu,...

work page doi:10.1145/3581754.3584136 2023
[7]

What needs do you or your household have in this camp that our assistance has not met so far?

https://doi.org/10.54941/ahfe1006232 34 Appendix 1 Krippendorff’s Alpha and Discrepancy Analysis Results across All Models Provider Model* Gold Standard Relevant Mentioned Incorrect Invalid Relevance Score K’s alpha Google Gemini-3.1-pro (reasoning- medium) 92.60% 6.60% 0.00% 0.90% 0.00% 95.90% 0.922 Google Gemini-3.1-pro 92.40% 6.30% 0.00% 1.30% 0.00% 95...

work page doi:10.54941/ahfe1006232

[1] [1]

http://dx.doi.org/10.3362/9781780448626 32 Alqazlan, L., Fang, Z., Castelle, M., & Procter, R

ACAPS (2014) Humanitarian Needs Assessment: The Good Enough Guide, The Assessment Capacities Project (ACAPS), Emergency Capacity Building Project (ECB) and Practical Action Publishing, Rugby, UK. http://dx.doi.org/10.3362/9781780448626 32 Alqazlan, L., Fang, Z., Castelle, M., & Procter, R. (2025). A novel, human-in-the-loop computational grounded theory f...

work page doi:10.3362/9781780448626 2014

[2] [2]

https://doi.org/10.1186/s12910- 025-01189-2 Krippendorff, K. (2019). Content analysis: an introduction to its methodology (4th ed.). SAGE. Li, Z., Dohan, D., & Abramson, C. M. (2021). Qualitative coding in the computational era: A hybrid approach to improve reliability and reduce effort for coding ethnographic interviews. Socius,

work page doi:10.1186/s12910- 2019

[3] [3]

https://doi.org/10.1177/23780231211062345 Lin, H., & Zhang, Y . (2025). Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research. Social Science Computer Review, 44(3), 403-427. https://doi.org/10.1177/08944393251366243 33 Maitland, C., & Xu, Y . (2015). A social informatics analysis of refugee mobile phone use: A...

work page doi:10.1177/23780231211062345 2025

[4] [4]

arXiv preprint https://doi.org/10.48550/arXiv.2511.14528 Rodríguez Triana, M

LLM-Assisted Thematic Analysis: Opportunities, Limitations, and Recommendations. arXiv preprint https://doi.org/10.48550/arXiv.2511.14528 Rodríguez Triana, M. J., Saban, M., Asensio-Pérez, J. I., Prieto, L. P., Haba-Ortuo, I., Villa- Torrano, C., & Gillet, D. (2025, September). Code-Aware LLM Prompting in Deductive Qualitative Analysis: A Study in Multi-f...

work page doi:10.48550/arxiv.2511.14528 2025

[5] [5]

The Sphere Handbook: Humanitarian Charter and Minimum Standards in Humanitarian Response, fourth edition, Geneva, Switzerland,

https://www.researchgate.net/publication/395189459_Code- Aware_LLM_Prompting_in_Deductive_Qualitative_Analysis_A_Study_in_Multi- framework_Analysis_of_Learning_Designs Sphere Association. The Sphere Handbook: Humanitarian Charter and Minimum Standards in Humanitarian Response, fourth edition, Geneva, Switzerland,

arXiv

[6] [6]

Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

www.spherestandards.org/handbook Xiao, Z., Yuan, X., Liao, Q. V ., Abdelghani, R., and Oudeyer, P.-Y . (2023). Supporting qualitative analysis with large language models: combining codebook with GPT-3 for deductive coding. 28th International Conference on Intelligent User Interfaces, 75–78. https://dl.acm.org/doi/pdf/10.1145/3581754.3584136 Zhang, H., Wu,...

work page doi:10.1145/3581754.3584136 2023

[7] [7]

What needs do you or your household have in this camp that our assistance has not met so far?

https://doi.org/10.54941/ahfe1006232 34 Appendix 1 Krippendorff’s Alpha and Discrepancy Analysis Results across All Models Provider Model* Gold Standard Relevant Mentioned Incorrect Invalid Relevance Score K’s alpha Google Gemini-3.1-pro (reasoning- medium) 92.60% 6.60% 0.00% 0.90% 0.00% 95.90% 0.922 Google Gemini-3.1-pro 92.40% 6.30% 0.00% 1.30% 0.00% 95...

work page doi:10.54941/ahfe1006232