pith. sign in

arxiv: 2604.03252 · v1 · submitted 2026-03-11 · 💻 cs.CY · cs.CL

Evaluating Digital Inclusiveness of Digital Agri-Food Tools Using Large Language Models: A Comparative Analysis Between Human and AI-Based Evaluations

Pith reviewed 2026-05-15 13:50 UTC · model grok-4.3

classification 💻 cs.CY cs.CL
keywords digital inclusivenesslarge language modelsagri-food toolsMDII frameworkAI evaluationcomparative analysis
0
0 comments X

The pith

Large language models can approximate expert human judgments when assessing digital inclusiveness of agricultural tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can replicate the slow, expert-driven evaluations of how inclusive digital agri-food tools are for users in divided regions. Human assessments via the MDII framework currently require months of work to score tools across multiple dimensions of access and usability. By pitting outputs from four models against existing expert scores, the study measures alignment, temperature effects, and bias patterns. The results indicate that LLMs can match human ratings in some areas but show inconsistent reliability depending on the model and context. This matters because faster evaluations could help expand monitoring where resources are limited.

Core claim

LLMs can generate evaluative outputs that approximate expert judgment in some dimensions of the Multidimensional Digital Inclusiveness Index, although reliability varies across models and contexts, providing early evidence for integrating generative AI into inclusive digital development monitoring.

What carries the argument

The Multidimensional Digital Inclusiveness Index (MDII) is the established human-led framework that supplies the benchmark scores against which the four LLMs (Grok, Gemini, GPT-4o, and GPT-5) are compared for alignment and bias.

Load-bearing premise

Prior human expert evaluations using the MDII framework serve as an accurate and unbiased ground truth for direct comparison with LLM outputs.

What would settle it

Consistent misalignment between LLM and human MDII scores across multiple tools and dimensions would show that the approximation does not hold reliably.

Figures

Figures reproduced from arXiv: 2604.03252 by Carolina Martins, Garcia Mariangel, Githma Pewinya.

Figure 1
Figure 1. Figure 1: Hierarchy of MDII components. Source: Adapted from (Martins, et al., 2024) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Stakeholder groups. Source: Adapted from (Martins, et al., 2024) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AI Rapid Assessment Workflow. Source: Authors. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Context Mapping and data collection process from Evaluators. Source: Authors. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MDII Scores generated by GPT-5 vs. Human-based MDII scores for each tool. Source: Authors. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MDII Scores generated by GPT-5 and GROK vs. Human-based MDII scores for each tool. Source: Authors. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Ensuring digital inclusiveness is a critical priority in agri-food systems, particularly in the Global South, where digital divides persist. The Multidimensional Digital Inclusiveness Index (MDII) offers a comprehensive, human-led framework to assess how inclusive digital agricultural tools (agritools) are. However, the current evaluation process is resource intensive, often requiring months to complete. This study explores whether large language models (LLMs) can support a rapid, AI-enabled assessment of digital inclusiveness, complementing the MDII's existing workflow. Using a comparative analysis, the research benchmarks the performance of four LLMs (Grok, Gemini, GPT-4o, and GPT-5) against prior expert-led evaluations. The study investigates model alignment with human scores, sensitivity to temperature settings, and potential sources of bias. Findings suggest that LLMs can generate evaluative outputs that approximate expert judgment in some dimensions, though reliability varies across models and contexts. This exploratory work provides early evidence for the integration of GenAI into inclusive digital development monitoring, with implications for scaling evaluations in time-sensitive or resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a comparative analysis using four LLMs (Grok, Gemini, GPT-4o, GPT-5) to evaluate digital inclusiveness of agri-food tools via the Multidimensional Digital Inclusiveness Index (MDII) framework. It benchmarks LLM outputs against prior human expert evaluations, examining alignment with human scores, temperature sensitivity, and bias sources, with the central claim that LLMs can approximate expert judgments in some dimensions though reliability varies.

Significance. If validated, the approach could enable faster, more scalable MDII assessments in resource-constrained Global South contexts, reducing evaluation timelines from months. The exploratory design provides early evidence for GenAI integration into digital development monitoring, with potential for time-sensitive applications. Strengths include the direct comparison to independent human baselines and focus on practical deployment issues like temperature effects.

major comments (3)
  1. [Methods] Methods: No inter-rater reliability statistics (e.g., Cohen’s kappa, ICC, or Fleiss’ kappa) are reported for the human expert MDII scores serving as ground truth. This undermines attribution of LLM-human differences to model performance rather than baseline human variability, which is load-bearing for the approximation claim.
  2. [Results] Results: The quantitative protocol for declaring LLM approximation to human scores is not specified (e.g., no Pearson/Spearman correlations, mean absolute error per MDII dimension, or explicit scoring rubric alignment). Without these metrics, the abstract’s claim of approximation in “some dimensions” cannot be evaluated for statistical or practical significance.
  3. [Discussion] Discussion: Sensitivity analyses for temperature settings and bias sources are described qualitatively but lack tabulated per-model, per-dimension error breakdowns or statistical tests comparing LLM outputs to the human reference, limiting reproducibility and strength of the reliability-variation conclusion.
minor comments (2)
  1. [Abstract] Abstract: Key quantitative results (e.g., correlation values or agreement rates) should be included to allow readers to assess the strength of the approximation findings without reading the full text.
  2. The manuscript would benefit from a table summarizing model-specific alignment metrics across MDII dimensions to improve clarity of the comparative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments, which help clarify the presentation of our exploratory study. We address each major point below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Methods] Methods: No inter-rater reliability statistics (e.g., Cohen’s kappa, ICC, or Fleiss’ kappa) are reported for the human expert MDII scores serving as ground truth. This undermines attribution of LLM-human differences to model performance rather than baseline human variability, which is load-bearing for the approximation claim.

    Authors: The human expert MDII scores are taken from a previously published expert evaluation study that did not report inter-rater reliability metrics, and the raw per-expert scoring data are not available to us. In the revised Methods section we will explicitly state this limitation, describe the source study’s expert consensus process, and discuss its implications for interpreting LLM alignment results, including the possibility of baseline variability. revision: partial

  2. Referee: [Results] Results: The quantitative protocol for declaring LLM approximation to human scores is not specified (e.g., no Pearson/Spearman correlations, mean absolute error per MDII dimension, or explicit scoring rubric alignment). Without these metrics, the abstract’s claim of approximation in “some dimensions” cannot be evaluated for statistical or practical significance.

    Authors: We agree that explicit quantitative metrics are needed. The revised Results section will report Pearson and Spearman correlations as well as mean absolute error for each MDII dimension between LLM and human scores. We will also define and apply clear criteria for “approximation” (e.g., correlation thresholds) and include these statistics with confidence intervals to allow readers to assess the strength of the alignment claims. revision: yes

  3. Referee: [Discussion] Discussion: Sensitivity analyses for temperature settings and bias sources are described qualitatively but lack tabulated per-model, per-dimension error breakdowns or statistical tests comparing LLM outputs to the human reference, limiting reproducibility and strength of the reliability-variation conclusion.

    Authors: We will expand the Discussion with new tables providing per-model and per-dimension error breakdowns (absolute differences from human scores) for each temperature setting. We will also add statistical comparisons (e.g., paired tests or correlation significance tests) between LLM outputs and the human reference to quantify reliability variations and improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; prior human MDII evaluations used as external benchmark

full rationale

The paper benchmarks LLM outputs against prior independent expert-led MDII evaluations on digital inclusiveness of agritools. No derivation chain, equation, or definition reduces LLM performance metrics to the human scores by construction, nor does any fitted parameter get relabeled as a prediction. The central comparison treats the human scores as fixed external reference data. While the manuscript does not report inter-rater reliability statistics for the human evaluations or explicit alignment protocols, this constitutes a methodological limitation rather than circularity. Self-citations to the MDII framework are present but not load-bearing for the LLM-approximation claim, which remains independently testable against the cited prior data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the MDII framework and prior human evaluations provide a reliable benchmark, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)
  • domain assumption The Multidimensional Digital Inclusiveness Index (MDII) offers a valid and comprehensive human-led framework for assessing digital inclusiveness of agritools.
    The entire comparison depends on treating MDII human evaluations as the reference standard.

pith-pipeline@v0.9.0 · 5504 in / 1276 out tokens · 49368 ms · 2026-05-15T13:50:41.887352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    O., Alhassan, M

    Adam, I. O., Alhassan, M. D., Shaibu, A., Abdul Mumin, M. & Abdulai, I. (2025). The effects of digital transformation on inequality: does the mediating effects of digital inclusion and ICT regulatory environment matter? Journal of Innovative Digital Transformation, 2 (2): 156–

  2. [2]

    https://doi.org/10.1108/JIDT-04-2024-0007 Alowais, S.A., Alghamdi, S.S., Alsuhebany, N. et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 23, 689 (2023). https://doi.org/10.1186/s12909-023-04698-z Bahn, R. A., Yehya, A. A. K. & Zurayk, R. (2021). Digitalization for Sustainable Agri-Food Systems: Pote...

  3. [3]

    https://doi.org/10.1145/3610219 Dara, R., Hazrati, Fard S. M. & Kaur, J. (2022) Recommendations for ethical and responsible use of artificial intelligence in digital agriculture. Front. Artif. Intell. 5:884192. https://doi.org/10.3389/frai.2022.884192 Djatmiko, G. H., Sinaga, O. & Pawirosumarto, S. (2025). Digital Transformation and Social Inclusion in Pu...

  4. [4]

    https://doi.org/10.3390/su17072908 Dunivin, Z. (2025). Scaling hermeneutics: a guide to qualitative coding with LLMs for reflexive content analysis. EPJ Data Science. https://doi.org/10.1140/epjds/s13688-025- 00548-8 Gemini-2.5-flash-lite. (2025, 08 18). Retrieved from Google AI: https://ai.google.dev/gemini-api/docs/models#gemini-2.5-flash-lite Harfouche...

  5. [5]

    & Chen, C

    https://doi.org/10.1038/s41746-024-01031-w Mao,Y., He, J. & Chen, C. (2025). From Prompts to Templates: A Systematic Prompt Template Analysis for Real-world LLMapps. Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 75–86. https://doi.org/10.1145/3696630...