Evaluating Digital Inclusiveness of Digital Agri-Food Tools Using Large Language Models: A Comparative Analysis Between Human and AI-Based Evaluations
Pith reviewed 2026-05-15 13:50 UTC · model grok-4.3
The pith
Large language models can approximate expert human judgments when assessing digital inclusiveness of agricultural tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs can generate evaluative outputs that approximate expert judgment in some dimensions of the Multidimensional Digital Inclusiveness Index, although reliability varies across models and contexts, providing early evidence for integrating generative AI into inclusive digital development monitoring.
What carries the argument
The Multidimensional Digital Inclusiveness Index (MDII) is the established human-led framework that supplies the benchmark scores against which the four LLMs (Grok, Gemini, GPT-4o, and GPT-5) are compared for alignment and bias.
Load-bearing premise
Prior human expert evaluations using the MDII framework serve as an accurate and unbiased ground truth for direct comparison with LLM outputs.
What would settle it
Consistent misalignment between LLM and human MDII scores across multiple tools and dimensions would show that the approximation does not hold reliably.
Figures
read the original abstract
Ensuring digital inclusiveness is a critical priority in agri-food systems, particularly in the Global South, where digital divides persist. The Multidimensional Digital Inclusiveness Index (MDII) offers a comprehensive, human-led framework to assess how inclusive digital agricultural tools (agritools) are. However, the current evaluation process is resource intensive, often requiring months to complete. This study explores whether large language models (LLMs) can support a rapid, AI-enabled assessment of digital inclusiveness, complementing the MDII's existing workflow. Using a comparative analysis, the research benchmarks the performance of four LLMs (Grok, Gemini, GPT-4o, and GPT-5) against prior expert-led evaluations. The study investigates model alignment with human scores, sensitivity to temperature settings, and potential sources of bias. Findings suggest that LLMs can generate evaluative outputs that approximate expert judgment in some dimensions, though reliability varies across models and contexts. This exploratory work provides early evidence for the integration of GenAI into inclusive digital development monitoring, with implications for scaling evaluations in time-sensitive or resource-constrained environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a comparative analysis using four LLMs (Grok, Gemini, GPT-4o, GPT-5) to evaluate digital inclusiveness of agri-food tools via the Multidimensional Digital Inclusiveness Index (MDII) framework. It benchmarks LLM outputs against prior human expert evaluations, examining alignment with human scores, temperature sensitivity, and bias sources, with the central claim that LLMs can approximate expert judgments in some dimensions though reliability varies.
Significance. If validated, the approach could enable faster, more scalable MDII assessments in resource-constrained Global South contexts, reducing evaluation timelines from months. The exploratory design provides early evidence for GenAI integration into digital development monitoring, with potential for time-sensitive applications. Strengths include the direct comparison to independent human baselines and focus on practical deployment issues like temperature effects.
major comments (3)
- [Methods] Methods: No inter-rater reliability statistics (e.g., Cohen’s kappa, ICC, or Fleiss’ kappa) are reported for the human expert MDII scores serving as ground truth. This undermines attribution of LLM-human differences to model performance rather than baseline human variability, which is load-bearing for the approximation claim.
- [Results] Results: The quantitative protocol for declaring LLM approximation to human scores is not specified (e.g., no Pearson/Spearman correlations, mean absolute error per MDII dimension, or explicit scoring rubric alignment). Without these metrics, the abstract’s claim of approximation in “some dimensions” cannot be evaluated for statistical or practical significance.
- [Discussion] Discussion: Sensitivity analyses for temperature settings and bias sources are described qualitatively but lack tabulated per-model, per-dimension error breakdowns or statistical tests comparing LLM outputs to the human reference, limiting reproducibility and strength of the reliability-variation conclusion.
minor comments (2)
- [Abstract] Abstract: Key quantitative results (e.g., correlation values or agreement rates) should be included to allow readers to assess the strength of the approximation findings without reading the full text.
- The manuscript would benefit from a table summarizing model-specific alignment metrics across MDII dimensions to improve clarity of the comparative results.
Simulated Author's Rebuttal
We thank the referee for these constructive comments, which help clarify the presentation of our exploratory study. We address each major point below and will revise the manuscript to incorporate the suggested improvements where feasible.
read point-by-point responses
-
Referee: [Methods] Methods: No inter-rater reliability statistics (e.g., Cohen’s kappa, ICC, or Fleiss’ kappa) are reported for the human expert MDII scores serving as ground truth. This undermines attribution of LLM-human differences to model performance rather than baseline human variability, which is load-bearing for the approximation claim.
Authors: The human expert MDII scores are taken from a previously published expert evaluation study that did not report inter-rater reliability metrics, and the raw per-expert scoring data are not available to us. In the revised Methods section we will explicitly state this limitation, describe the source study’s expert consensus process, and discuss its implications for interpreting LLM alignment results, including the possibility of baseline variability. revision: partial
-
Referee: [Results] Results: The quantitative protocol for declaring LLM approximation to human scores is not specified (e.g., no Pearson/Spearman correlations, mean absolute error per MDII dimension, or explicit scoring rubric alignment). Without these metrics, the abstract’s claim of approximation in “some dimensions” cannot be evaluated for statistical or practical significance.
Authors: We agree that explicit quantitative metrics are needed. The revised Results section will report Pearson and Spearman correlations as well as mean absolute error for each MDII dimension between LLM and human scores. We will also define and apply clear criteria for “approximation” (e.g., correlation thresholds) and include these statistics with confidence intervals to allow readers to assess the strength of the alignment claims. revision: yes
-
Referee: [Discussion] Discussion: Sensitivity analyses for temperature settings and bias sources are described qualitatively but lack tabulated per-model, per-dimension error breakdowns or statistical tests comparing LLM outputs to the human reference, limiting reproducibility and strength of the reliability-variation conclusion.
Authors: We will expand the Discussion with new tables providing per-model and per-dimension error breakdowns (absolute differences from human scores) for each temperature setting. We will also add statistical comparisons (e.g., paired tests or correlation significance tests) between LLM outputs and the human reference to quantify reliability variations and improve reproducibility. revision: yes
Circularity Check
No significant circularity; prior human MDII evaluations used as external benchmark
full rationale
The paper benchmarks LLM outputs against prior independent expert-led MDII evaluations on digital inclusiveness of agritools. No derivation chain, equation, or definition reduces LLM performance metrics to the human scores by construction, nor does any fitted parameter get relabeled as a prediction. The central comparison treats the human scores as fixed external reference data. While the manuscript does not report inter-rater reliability statistics for the human evaluations or explicit alignment protocols, this constitutes a methodological limitation rather than circularity. Self-citations to the MDII framework are present but not load-bearing for the LLM-approximation claim, which remains independently testable against the cited prior data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Multidimensional Digital Inclusiveness Index (MDII) offers a valid and comprehensive human-led framework for assessing digital inclusiveness of agritools.
Reference graph
Works this paper leans on
-
[1]
Adam, I. O., Alhassan, M. D., Shaibu, A., Abdul Mumin, M. & Abdulai, I. (2025). The effects of digital transformation on inequality: does the mediating effects of digital inclusion and ICT regulatory environment matter? Journal of Innovative Digital Transformation, 2 (2): 156–
work page 2025
-
[2]
https://doi.org/10.1108/JIDT-04-2024-0007 Alowais, S.A., Alghamdi, S.S., Alsuhebany, N. et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 23, 689 (2023). https://doi.org/10.1186/s12909-023-04698-z Bahn, R. A., Yehya, A. A. K. & Zurayk, R. (2021). Digitalization for Sustainable Agri-Food Systems: Pote...
-
[3]
https://doi.org/10.1145/3610219 Dara, R., Hazrati, Fard S. M. & Kaur, J. (2022) Recommendations for ethical and responsible use of artificial intelligence in digital agriculture. Front. Artif. Intell. 5:884192. https://doi.org/10.3389/frai.2022.884192 Djatmiko, G. H., Sinaga, O. & Pawirosumarto, S. (2025). Digital Transformation and Social Inclusion in Pu...
-
[4]
https://doi.org/10.3390/su17072908 Dunivin, Z. (2025). Scaling hermeneutics: a guide to qualitative coding with LLMs for reflexive content analysis. EPJ Data Science. https://doi.org/10.1140/epjds/s13688-025- 00548-8 Gemini-2.5-flash-lite. (2025, 08 18). Retrieved from Google AI: https://ai.google.dev/gemini-api/docs/models#gemini-2.5-flash-lite Harfouche...
-
[5]
https://doi.org/10.1038/s41746-024-01031-w Mao,Y., He, J. & Chen, C. (2025). From Prompts to Templates: A Systematic Prompt Template Analysis for Real-world LLMapps. Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 75–86. https://doi.org/10.1145/3696630...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.