pith. machine review for the scientific record. sign in

arxiv: 2604.07883 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.CL· cs.CY· cs.MA

Recognition: unknown

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.MA
keywords agentic AIbias detectionhistory textbookseducational content auditmulti-agent systemssource attributionpedagogical evaluation
0
0 comments X

The pith

A multi-agent AI architecture with source attribution protocol reduces over-penalization when auditing history textbook biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an evaluation system built from a multimodal screening agent, a jury of five specialized agents, and a meta-agent that synthesizes verdicts while escalating difficult cases to humans. It tests the system on 270 excerpts from Romanian upper-secondary history textbooks and finds that 83.3 percent are judged pedagogically acceptable with average severity 2.9 out of 7, far milder than the 5.4 score produced by direct zero-shot prompting. A blind comparison with 18 human evaluators shows the independent deliberation mode preferred in 64.8 percent of head-to-head trials. The work matters because manual review of entire textbooks is slow and costly while single large language models tend to flag too much acceptable content as biased.

Core claim

The author claims that an agentic evaluation architecture, built around a heterogeneous jury of five evaluative agents plus a meta-agent for synthesis and supported by a Source Attribution Protocol that separates textbook narrative from quoted historical sources, produces more balanced judgments of pedagogical acceptability than single-model baselines. On the Romanian textbook corpus this yields an 83.3 percent acceptable rate at mean severity 2.9/7 versus 5.4/7 for zero-shot evaluation, and the independent deliberation configuration is chosen by humans in 64.8 percent of blind pairwise comparisons.

What carries the argument

The heterogeneous jury of five evaluative agents together with a meta-agent for verdict synthesis, enabled by the Source Attribution Protocol that prevents misattributing quoted historical sources as textbook narrative.

If this is right

  • Textbooks can be audited at roughly two dollars each rather than requiring full manual review.
  • Agentic deliberation lowers the mean severity score assigned to content and therefore reduces over-flagging of acceptable material.
  • Human evaluators prefer the deliberative configuration over both heuristic variants and zero-shot prompting in blind tests.
  • The architecture can serve as a scalable decision-support tool for ministries or boards that approve educational materials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same jury-plus-protocol design could be applied to textbooks in other national contexts or non-history subjects where selective framing is a concern.
  • Escalation thresholds could be tuned so that only a small fraction of excerpts require human review, creating a practical hybrid workflow.
  • Repeated application across successive editions of a textbook series might reveal whether authors respond to the feedback by changing how they quote sources.
  • Cross-lingual tests would show whether the Source Attribution Protocol continues to work when historical quotations appear in translation.

Load-bearing premise

The jury agents and meta-agent can combine their outputs into accurate verdicts on pedagogical acceptability without introducing systematic biases of their own, and the Source Attribution Protocol reliably separates textbook narrative from quoted sources in every case.

What would settle it

A follow-up study in which professional historians independently rate the same 270 excerpts for pedagogical acceptability and the resulting agreement rates with the agentic system are compared against agreement rates with the zero-shot baseline.

Figures

Figures reproduced from arXiv: 2604.07883 by Adrian-Marius Dumitran, Gabriel Stefan.

Figure 1
Figure 1. Figure 1: The three-stage agentic evaluation architecture. The flow illustrates the screen￾ing agent, heterogeneous jury, and meta-agent for verdict synthesis [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A generated HTML report highlighting extracted historical bias, taxonomy categorization, and assigned severity scores Report Format. Pipeline outputs are serialized as static HTML reports (see [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Severity distributions for the agentic pipeline (dark gray, µ = 2.9) and zero-shot baseline (light gray, µ = 5.4). Samples differ in size and configuration. The distribution is strongly concentrated around severity 3, with a thin right tail (2 excerpts at severity 5; none at 6 or 7). This indicates that the agentic architecture is well-calibrated for deployment: the jury layer avoids both under￾penalizatio… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluator preferences (N = 54 comparisons). Independent Deliberation was consistently preferred in randomized blind testing across all textbooks [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8\% of cases over both a heuristic variant and the zero-shot baseline. At approximately \$2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an agentic evaluation architecture for auditing historical biases in educational textbooks. It comprises a multimodal screening agent, a heterogeneous jury of five evaluative agents, a meta-agent for verdict synthesis and human escalation, and a Source Attribution Protocol to distinguish textbook narrative from quoted historical sources. In an empirical study on 270 excerpts from Romanian upper-secondary history textbooks, the system classifies 83.3% as pedagogically acceptable (mean severity 2.9/7) versus 5.4/7 for a zero-shot baseline. A blind human evaluation (18 evaluators, 54 comparisons) finds the Independent Deliberation configuration preferred in 64.8% of cases. The work positions the approach as an economically viable (~$2 per textbook) decision-support tool for educational governance.

Significance. If the agentic components reliably track expert pedagogical judgment rather than model priors, this architecture could enable scalable, low-cost auditing of biases in history textbooks, a domain where manual review is resource-intensive. The Source Attribution Protocol directly targets a known failure mode in LLM evaluators, and the reported cost-effectiveness plus human preference data strengthen the case for practical utility. The empirical comparison to a zero-shot baseline provides a clear baseline for assessing the value of deliberation.

major comments (3)
  1. [Empirical Evaluation] Empirical Evaluation section: The headline result that 83.3% of excerpts are acceptable (mean severity 2.9/7) versus the baseline hinges on the heterogeneous jury plus meta-agent producing verdicts that align with actual pedagogical acceptability. No inter-agent consistency metrics, prompt details, or controls for evaluator-model biases are reported, leaving open whether the reduction in severity reflects genuine mitigation or collective leniency.
  2. [Source Attribution Protocol] Source Attribution Protocol subsection: This protocol is presented as the key safeguard against misattributing quoted sources and thereby reducing false positives. However, no external validation (e.g., agreement with historian annotations on the 270 excerpts or a held-out subset) is provided, which is load-bearing for attributing the observed improvement over the zero-shot baseline to the protocol rather than other system elements.
  3. [Human Evaluation] Human Evaluation section: The 64.8% preference for Independent Deliberation is based on 18 evaluators and 54 comparisons. While useful for relative ranking of configurations, the study does not report inter-rater agreement with domain experts or accuracy against ground-truth pedagogical labels, so it cannot confirm that lower severity scores correspond to improved correctness rather than evaluator preference for milder outputs.
minor comments (2)
  1. [Abstract] The 1-7 severity scale is referenced repeatedly but never defined (e.g., what distinguishes level 3 from level 4). Adding an explicit rubric or example annotations would improve interpretability of the mean scores.
  2. A summary table listing the five jury agents, their individual evaluation criteria, and the meta-agent's synthesis rules would clarify the architecture and aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating planned revisions to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Empirical Evaluation] Empirical Evaluation section: The headline result that 83.3% of excerpts are acceptable (mean severity 2.9/7) versus the baseline hinges on the heterogeneous jury plus meta-agent producing verdicts that align with actual pedagogical acceptability. No inter-agent consistency metrics, prompt details, or controls for evaluator-model biases are reported, leaving open whether the reduction in severity reflects genuine mitigation or collective leniency.

    Authors: We agree that inter-agent consistency metrics, full prompt details, and explicit discussion of controls for model biases should have been included. In the revised manuscript we will add these elements: pairwise and overall agreement rates among the five jury agents, the complete prompt templates in an appendix, and a dedicated paragraph explaining how the heterogeneous personas plus meta-agent synthesis are intended to mitigate individual model priors. The human preference results provide supporting evidence against simple collective leniency, but we will also add an explicit limitations paragraph acknowledging that absolute alignment with expert pedagogical judgment cannot be proven without ground-truth labels. revision: partial

  2. Referee: [Source Attribution Protocol] Source Attribution Protocol subsection: This protocol is presented as the key safeguard against misattributing quoted sources and thereby reducing false positives. However, no external validation (e.g., agreement with historian annotations on the 270 excerpts or a held-out subset) is provided, which is load-bearing for attributing the observed improvement over the zero-shot baseline to the protocol rather than other system elements.

    Authors: We acknowledge that external validation against historian annotations is absent from the current manuscript and represents a genuine limitation for isolating the protocol's contribution. The protocol is implemented as an explicit, auditable instruction within the agent prompts to separate quoted historical sources from textbook narrative. In the revision we will add a limitations section that clearly states this gap and, resources permitting, include preliminary agreement metrics on a small held-out subset of excerpts. We will also clarify in the text that the reported gains are attributable to the full architecture rather than any single component. revision: partial

  3. Referee: [Human Evaluation] Human Evaluation section: The 64.8% preference for Independent Deliberation is based on 18 evaluators and 54 comparisons. While useful for relative ranking of configurations, the study does not report inter-rater agreement with domain experts or accuracy against ground-truth pedagogical labels, so it cannot confirm that lower severity scores correspond to improved correctness rather than evaluator preference for milder outputs.

    Authors: We will report inter-rater agreement statistics (e.g., Fleiss' kappa) among the 18 evaluators in the revised version. The evaluators were drawn from education and history backgrounds to the extent possible within the study constraints. We agree that accuracy against expert ground-truth labels would be stronger evidence; however, obtaining such labels for the full set of excerpts was outside the scope and resources of this work. The preference data is presented strictly as a relative utility measure. We will revise the discussion and conclusion to avoid any implication of proven correctness and to emphasize the relative improvement over the baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from external textbook data and independent human evaluation

full rationale

The paper presents an agentic architecture and reports direct empirical outcomes from screening 270 excerpts drawn from Romanian upper-secondary history textbooks, plus a separate blind human preference study involving 18 evaluators and 54 comparisons. These measurements (83.3% acceptable rate, severity scores, and 64.8% preference) are obtained by applying the system to external source material and collecting external judgments; they are not obtained by fitting parameters to the target quantities and then re-deriving those same quantities, nor by any self-referential equations or self-citation chains that would force the reported numbers. The Source Attribution Protocol and jury synthesis are architectural components whose performance is evaluated against the external baseline and human data rather than being presupposed by the results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The architecture rests on assumptions about LLM capabilities for source distinction and bias judgment; no free parameters explicitly fitted in the abstract, though severity scoring implies a chosen scale.

axioms (2)
  • domain assumption Multimodal agents can process textbook text and images to screen for bias indicators
    Invoked in the multimodal screening agent description.
  • domain assumption Heterogeneous agents produce complementary evaluations that improve overall accuracy when synthesized
    Central to the jury and meta-agent design.

pith-pipeline@v0.9.0 · 5489 in / 1361 out tokens · 54241 ms · 2026-05-10T17:09:29.464600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 23 canonical work pages · 12 internal anchors

  1. [1]

    Routledge (2014).https://doi.org/10.4324/9780203814383

    Apple, M.: Official knowledge: Democratic education in a conservative age, third edition. Routledge (2014).https://doi.org/10.4324/9780203814383

  2. [2]

    Bai, J., Bai, S., Yang, S., et al.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023).https://doi.org/ 10.48550/arXiv.2308.12966

  3. [3]

    Bissyandé

    Chan, C.M., Chen, W., Su, Y., et al.: Chateval: Towards better llm evaluations via multi-agent debate. In: ICLR (2024).https://doi.org/10.48550/arXiv.2308. 07201

  4. [4]

    Agentverse: Facilitat- ing multi-agent collaboration and exploring emergent behaviors in agents

    Chen, W., Su, Y., Zuo, et al.: Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In: ICLR (2024).https://doi.org/10.48550/ arXiv.2308.10848

  5. [5]

    Deep Cogito: Cogito v2.1 671b model card.https://huggingface.co/deepcogito (2025)

  6. [6]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, et al.: Deepseek-v3 technical report (2024).https://doi.org/10. 48550/arXiv.2412.19437

  7. [7]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Du, Y., Li, S., Torralba, A., Tenenbaum, et al.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023),https://arxiv.org/abs/2305.14325

  8. [8]

    Hsieh, C.P., Simig, D., et al.: Ruler: What’s the real context size of your long- context language models? In: Proceedings of EMNLP (2024),https://arxiv.org/ abs/2404.06654

  9. [9]

    ACM Computing Surveys55(12), 1–38 (2023).https://doi.org/10

    Ji, Z., Lee, N., Frieske, R., et al.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023).https://doi.org/10. 1145/3571730

  10. [10]

    Jiang, A.Q., Sablayrolles, A., Roux, A., et al.: Mixtral of experts (2024).https: //doi.org/10.48550/arXiv.2401.04088

  11. [11]

    British Journal of Applied Science & Technology7(4), 396–403 (2015).https://doi.org/ 10.9734/BJAST/2015/14975

    Joshi, A., Kale, S., Chandel, et al.: Likert scale: Explored and explained. British Journal of Applied Science & Technology7(4), 396–403 (2015).https://doi.org/ 10.9734/BJAST/2015/14975

  12. [12]

    Learning and Individ- ual Differences103, 102274 (2023).https://doi.org/10.1016/j.lindif.2023

    Kasneci, E., Sessler, K., Küchemann, S., et al.: ChatGPT for good? on opportuni- ties and challenges of large language models for education. Learning and Individ- ual Differences103, 102274 (2023).https://doi.org/10.1016/j.lindif.2023. 102274

  13. [13]

    Kimi Team, Bai, Y., et al.: Kimi k2: Open agentic intelligence (2025).https: //doi.org/10.48550/arXiv.2507.20534

  14. [14]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, P., Perez, E., Piktus, A., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: NeurIPS (2020).https://doi.org/10.48550/ arXiv.2005.11401

  15. [15]

    Holistic Evaluation of Language Models

    Liang, P., Bommasani, R., Lee, T., et al.: Holistic evaluation of language models. In: TMLR (2023).https://doi.org/10.48550/arXiv.2211.09110

  16. [16]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Liu, N.F., Lin, K., Chen, J., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157–173 (2024).https://doi.org/10.1162/tacl_a_00638 16 G. S ,tefan and A. M. Dumitran

  17. [17]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Liu, Y., Iter, D., Xu, Y., et al.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of EMNLP. pp. 2511–2522 (2023).https:// doi.org/10.18653/v1/2023.emnlp-main.153

  18. [18]

    Pearson London (2016)

    Luckin, R., Holmes, W., Griffiths, M., Forcier, L.B.: Intelligence Unleashed: An Argument for AI in Education. Pearson London (2016)

  19. [19]

    In: AERA Open (2020).https://doi.org/10.1177/ 2332858420940312

    Lucy, L., Demszky, D., Bromley, P., Jurafsky, D.: Content analysis of text- books via natural language processing: Findings on gender, race, and ethnicity in texas us history textbooks. In: AERA Open (2020).https://doi.org/10.1177/ 2332858420940312

  20. [20]

    Meta AI, et al.: Llama 4 maverick.https://ai.meta.com/llama/(2025)

  21. [21]

    Ministerul Educat,iei: Comunicat de presă nr. 109/2025 privind constituirea grupurilor de lucru pentru elaborarea programelor s,colare.https://www.edu.ro/ press_rel_109_2025_grupuri_lucru_programe_scolare_inv_liceal(2025), ac- cessed: 2025-02-24

  22. [22]

    OpenAI: Gpt-5.2 technical specifications.https://developers.openai.com/api/ docs/models/gpt-5.2(2026)

  23. [23]

    OpenAI, Agarwal, S., Ahmad, L., Ai, J., Altman, S., et al.: gpt-oss-120b & gpt- oss-20b model card (2025).https://doi.org/10.48550/arXiv.2508.10925

  24. [24]

    UN- ESCO, Paris (2010)

    Pingel, F.: UNESCO Guide on Textbook Research and Textbook Revision. UN- ESCO, Paris (2010)

  25. [25]

    Acta Psychologica104(1), 1–15 (2000).https://doi.org/10.1016/S0001-6918(99) 00050-5

    Preston, C.C., Colman, A.M.: Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica104(1), 1–15 (2000).https://doi.org/10.1016/S0001-6918(99) 00050-5

  26. [26]

    Röttger, P., Kirk, H.R., Vidgen, B., et al.: Xstest: A test suite for identifying exaggeratedsafetybehavioursinlargelanguagemodels.In:NAACL(2024).https: //doi.org/10.48550/arXiv.2308.01263

  27. [27]

    Coun- cil of Europe Publishing (2003)

    Stradling, R.: Multiperspectivity in History Teaching: A Guide for Teachers. Coun- cil of Europe Publishing (2003)

  28. [28]

    UNESCO Pub- lishing, Paris (2023)

    UNESCO: Guidance for generative AI in education and research. UNESCO Pub- lishing, Paris (2023)

  29. [29]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Q., Bansal, G., Zhang, et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023).https://doi. org/10.48550/arXiv.2308.08155

  30. [30]

    Xu, Y., Li, M., Cui, L., et al.: Layoutlm: Pre-training of text and layout for docu- mentimageunderstanding.In:ACMSIGKDD(2020).https://doi.org/10.1145/ 3394486.3403172

  31. [31]

    Complexity2021, 1–18 (2021).https://doi.org/ 10.1155/2021/8812542

    Zhai, X., Chu, X., Chai, C.S., et al.: A review of artificial intelligence (ai) in education from 2010 to 2020. Complexity2021, 1–18 (2021).https://doi.org/ 10.1155/2021/8812542

  32. [32]

    S iren ' s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Zhang, Y., Li, Y., Cui, L., et al.: Siren’s song in the ai ocean: A survey on hal- lucination in large language models. Computational Linguistics (2025).https: //doi.org/10.1162/COLI.a.16

  33. [33]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, et al.: Judging LLM-as-a-judge with MT-Bench and chatbot arena. In: Advances in Neural Information Processing Sys- tems (NeurIPS). vol. 36 (2023),https://arxiv.org/abs/2306.05685