Recognition: unknown
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3
The pith
A multi-agent AI architecture with source attribution protocol reduces over-penalization when auditing history textbook biases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The author claims that an agentic evaluation architecture, built around a heterogeneous jury of five evaluative agents plus a meta-agent for synthesis and supported by a Source Attribution Protocol that separates textbook narrative from quoted historical sources, produces more balanced judgments of pedagogical acceptability than single-model baselines. On the Romanian textbook corpus this yields an 83.3 percent acceptable rate at mean severity 2.9/7 versus 5.4/7 for zero-shot evaluation, and the independent deliberation configuration is chosen by humans in 64.8 percent of blind pairwise comparisons.
What carries the argument
The heterogeneous jury of five evaluative agents together with a meta-agent for verdict synthesis, enabled by the Source Attribution Protocol that prevents misattributing quoted historical sources as textbook narrative.
If this is right
- Textbooks can be audited at roughly two dollars each rather than requiring full manual review.
- Agentic deliberation lowers the mean severity score assigned to content and therefore reduces over-flagging of acceptable material.
- Human evaluators prefer the deliberative configuration over both heuristic variants and zero-shot prompting in blind tests.
- The architecture can serve as a scalable decision-support tool for ministries or boards that approve educational materials.
Where Pith is reading between the lines
- The same jury-plus-protocol design could be applied to textbooks in other national contexts or non-history subjects where selective framing is a concern.
- Escalation thresholds could be tuned so that only a small fraction of excerpts require human review, creating a practical hybrid workflow.
- Repeated application across successive editions of a textbook series might reveal whether authors respond to the feedback by changing how they quote sources.
- Cross-lingual tests would show whether the Source Attribution Protocol continues to work when historical quotations appear in translation.
Load-bearing premise
The jury agents and meta-agent can combine their outputs into accurate verdicts on pedagogical acceptability without introducing systematic biases of their own, and the Source Attribution Protocol reliably separates textbook narrative from quoted sources in every case.
What would settle it
A follow-up study in which professional historians independently rate the same 270 excerpts for pedagogical acceptability and the resulting agreement rates with the agentic system are compared against agreement rates with the zero-shot baseline.
Figures
read the original abstract
History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8\% of cases over both a heuristic variant and the zero-shot baseline. At approximately \$2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an agentic evaluation architecture for auditing historical biases in educational textbooks. It comprises a multimodal screening agent, a heterogeneous jury of five evaluative agents, a meta-agent for verdict synthesis and human escalation, and a Source Attribution Protocol to distinguish textbook narrative from quoted historical sources. In an empirical study on 270 excerpts from Romanian upper-secondary history textbooks, the system classifies 83.3% as pedagogically acceptable (mean severity 2.9/7) versus 5.4/7 for a zero-shot baseline. A blind human evaluation (18 evaluators, 54 comparisons) finds the Independent Deliberation configuration preferred in 64.8% of cases. The work positions the approach as an economically viable (~$2 per textbook) decision-support tool for educational governance.
Significance. If the agentic components reliably track expert pedagogical judgment rather than model priors, this architecture could enable scalable, low-cost auditing of biases in history textbooks, a domain where manual review is resource-intensive. The Source Attribution Protocol directly targets a known failure mode in LLM evaluators, and the reported cost-effectiveness plus human preference data strengthen the case for practical utility. The empirical comparison to a zero-shot baseline provides a clear baseline for assessing the value of deliberation.
major comments (3)
- [Empirical Evaluation] Empirical Evaluation section: The headline result that 83.3% of excerpts are acceptable (mean severity 2.9/7) versus the baseline hinges on the heterogeneous jury plus meta-agent producing verdicts that align with actual pedagogical acceptability. No inter-agent consistency metrics, prompt details, or controls for evaluator-model biases are reported, leaving open whether the reduction in severity reflects genuine mitigation or collective leniency.
- [Source Attribution Protocol] Source Attribution Protocol subsection: This protocol is presented as the key safeguard against misattributing quoted sources and thereby reducing false positives. However, no external validation (e.g., agreement with historian annotations on the 270 excerpts or a held-out subset) is provided, which is load-bearing for attributing the observed improvement over the zero-shot baseline to the protocol rather than other system elements.
- [Human Evaluation] Human Evaluation section: The 64.8% preference for Independent Deliberation is based on 18 evaluators and 54 comparisons. While useful for relative ranking of configurations, the study does not report inter-rater agreement with domain experts or accuracy against ground-truth pedagogical labels, so it cannot confirm that lower severity scores correspond to improved correctness rather than evaluator preference for milder outputs.
minor comments (2)
- [Abstract] The 1-7 severity scale is referenced repeatedly but never defined (e.g., what distinguishes level 3 from level 4). Adding an explicit rubric or example annotations would improve interpretability of the mean scores.
- A summary table listing the five jury agents, their individual evaluation criteria, and the meta-agent's synthesis rules would clarify the architecture and aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating planned revisions to improve transparency and rigor.
read point-by-point responses
-
Referee: [Empirical Evaluation] Empirical Evaluation section: The headline result that 83.3% of excerpts are acceptable (mean severity 2.9/7) versus the baseline hinges on the heterogeneous jury plus meta-agent producing verdicts that align with actual pedagogical acceptability. No inter-agent consistency metrics, prompt details, or controls for evaluator-model biases are reported, leaving open whether the reduction in severity reflects genuine mitigation or collective leniency.
Authors: We agree that inter-agent consistency metrics, full prompt details, and explicit discussion of controls for model biases should have been included. In the revised manuscript we will add these elements: pairwise and overall agreement rates among the five jury agents, the complete prompt templates in an appendix, and a dedicated paragraph explaining how the heterogeneous personas plus meta-agent synthesis are intended to mitigate individual model priors. The human preference results provide supporting evidence against simple collective leniency, but we will also add an explicit limitations paragraph acknowledging that absolute alignment with expert pedagogical judgment cannot be proven without ground-truth labels. revision: partial
-
Referee: [Source Attribution Protocol] Source Attribution Protocol subsection: This protocol is presented as the key safeguard against misattributing quoted sources and thereby reducing false positives. However, no external validation (e.g., agreement with historian annotations on the 270 excerpts or a held-out subset) is provided, which is load-bearing for attributing the observed improvement over the zero-shot baseline to the protocol rather than other system elements.
Authors: We acknowledge that external validation against historian annotations is absent from the current manuscript and represents a genuine limitation for isolating the protocol's contribution. The protocol is implemented as an explicit, auditable instruction within the agent prompts to separate quoted historical sources from textbook narrative. In the revision we will add a limitations section that clearly states this gap and, resources permitting, include preliminary agreement metrics on a small held-out subset of excerpts. We will also clarify in the text that the reported gains are attributable to the full architecture rather than any single component. revision: partial
-
Referee: [Human Evaluation] Human Evaluation section: The 64.8% preference for Independent Deliberation is based on 18 evaluators and 54 comparisons. While useful for relative ranking of configurations, the study does not report inter-rater agreement with domain experts or accuracy against ground-truth pedagogical labels, so it cannot confirm that lower severity scores correspond to improved correctness rather than evaluator preference for milder outputs.
Authors: We will report inter-rater agreement statistics (e.g., Fleiss' kappa) among the 18 evaluators in the revised version. The evaluators were drawn from education and history backgrounds to the extent possible within the study constraints. We agree that accuracy against expert ground-truth labels would be stronger evidence; however, obtaining such labels for the full set of excerpts was outside the scope and resources of this work. The preference data is presented strictly as a relative utility measure. We will revise the discussion and conclusion to avoid any implication of proven correctness and to emphasize the relative improvement over the baseline. revision: partial
Circularity Check
No circularity: empirical results from external textbook data and independent human evaluation
full rationale
The paper presents an agentic architecture and reports direct empirical outcomes from screening 270 excerpts drawn from Romanian upper-secondary history textbooks, plus a separate blind human preference study involving 18 evaluators and 54 comparisons. These measurements (83.3% acceptable rate, severity scores, and 64.8% preference) are obtained by applying the system to external source material and collecting external judgments; they are not obtained by fitting parameters to the target quantities and then re-deriving those same quantities, nor by any self-referential equations or self-citation chains that would force the reported numbers. The Source Attribution Protocol and jury synthesis are architectural components whose performance is evaluated against the external baseline and human data rather than being presupposed by the results themselves.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Multimodal agents can process textbook text and images to screen for bias indicators
- domain assumption Heterogeneous agents produce complementary evaluations that improve overall accuracy when synthesized
Reference graph
Works this paper leans on
-
[1]
Routledge (2014).https://doi.org/10.4324/9780203814383
Apple, M.: Official knowledge: Democratic education in a conservative age, third edition. Routledge (2014).https://doi.org/10.4324/9780203814383
-
[2]
Bai, J., Bai, S., Yang, S., et al.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023).https://doi.org/ 10.48550/arXiv.2308.12966
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966 2023
-
[3]
Chan, C.M., Chen, W., Su, Y., et al.: Chateval: Towards better llm evaluations via multi-agent debate. In: ICLR (2024).https://doi.org/10.48550/arXiv.2308. 07201
-
[4]
Agentverse: Facilitat- ing multi-agent collaboration and exploring emergent behaviors in agents
Chen, W., Su, Y., Zuo, et al.: Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In: ICLR (2024).https://doi.org/10.48550/ arXiv.2308.10848
-
[5]
Deep Cogito: Cogito v2.1 671b model card.https://huggingface.co/deepcogito (2025)
2025
-
[6]
DeepSeek-AI, et al.: Deepseek-v3 technical report (2024).https://doi.org/10. 48550/arXiv.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Y., Li, S., Torralba, A., Tenenbaum, et al.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023),https://arxiv.org/abs/2305.14325
work page internal anchor Pith review arXiv 2023
-
[8]
Hsieh, C.P., Simig, D., et al.: Ruler: What’s the real context size of your long- context language models? In: Proceedings of EMNLP (2024),https://arxiv.org/ abs/2404.06654
work page internal anchor Pith review arXiv 2024
-
[9]
ACM Computing Surveys55(12), 1–38 (2023).https://doi.org/10
Ji, Z., Lee, N., Frieske, R., et al.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023).https://doi.org/10. 1145/3571730
2023
-
[10]
Jiang, A.Q., Sablayrolles, A., Roux, A., et al.: Mixtral of experts (2024).https: //doi.org/10.48550/arXiv.2401.04088
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
-
[11]
Joshi, A., Kale, S., Chandel, et al.: Likert scale: Explored and explained. British Journal of Applied Science & Technology7(4), 396–403 (2015).https://doi.org/ 10.9734/BJAST/2015/14975
-
[12]
Learning and Individ- ual Differences103, 102274 (2023).https://doi.org/10.1016/j.lindif.2023
Kasneci, E., Sessler, K., Küchemann, S., et al.: ChatGPT for good? on opportuni- ties and challenges of large language models for education. Learning and Individ- ual Differences103, 102274 (2023).https://doi.org/10.1016/j.lindif.2023. 102274
-
[13]
Kimi Team, Bai, Y., et al.: Kimi k2: Open agentic intelligence (2025).https: //doi.org/10.48550/arXiv.2507.20534
work page internal anchor Pith review doi:10.48550/arxiv.2507.20534 2025
-
[14]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis, P., Perez, E., Piktus, A., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: NeurIPS (2020).https://doi.org/10.48550/ arXiv.2005.11401
work page internal anchor Pith review arXiv 2020
-
[15]
Holistic Evaluation of Language Models
Liang, P., Bommasani, R., Lee, T., et al.: Holistic evaluation of language models. In: TMLR (2023).https://doi.org/10.48550/arXiv.2211.09110
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2023
-
[16]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Liu, N.F., Lin, K., Chen, J., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157–173 (2024).https://doi.org/10.1162/tacl_a_00638 16 G. S ,tefan and A. M. Dumitran
-
[17]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Liu, Y., Iter, D., Xu, Y., et al.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of EMNLP. pp. 2511–2522 (2023).https:// doi.org/10.18653/v1/2023.emnlp-main.153
-
[18]
Pearson London (2016)
Luckin, R., Holmes, W., Griffiths, M., Forcier, L.B.: Intelligence Unleashed: An Argument for AI in Education. Pearson London (2016)
2016
-
[19]
In: AERA Open (2020).https://doi.org/10.1177/ 2332858420940312
Lucy, L., Demszky, D., Bromley, P., Jurafsky, D.: Content analysis of text- books via natural language processing: Findings on gender, race, and ethnicity in texas us history textbooks. In: AERA Open (2020).https://doi.org/10.1177/ 2332858420940312
2020
-
[20]
Meta AI, et al.: Llama 4 maverick.https://ai.meta.com/llama/(2025)
2025
-
[21]
Ministerul Educat,iei: Comunicat de presă nr. 109/2025 privind constituirea grupurilor de lucru pentru elaborarea programelor s,colare.https://www.edu.ro/ press_rel_109_2025_grupuri_lucru_programe_scolare_inv_liceal(2025), ac- cessed: 2025-02-24
2025
-
[22]
OpenAI: Gpt-5.2 technical specifications.https://developers.openai.com/api/ docs/models/gpt-5.2(2026)
2026
-
[23]
OpenAI, Agarwal, S., Ahmad, L., Ai, J., Altman, S., et al.: gpt-oss-120b & gpt- oss-20b model card (2025).https://doi.org/10.48550/arXiv.2508.10925
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
-
[24]
UN- ESCO, Paris (2010)
Pingel, F.: UNESCO Guide on Textbook Research and Textbook Revision. UN- ESCO, Paris (2010)
2010
-
[25]
Acta Psychologica104(1), 1–15 (2000).https://doi.org/10.1016/S0001-6918(99) 00050-5
Preston, C.C., Colman, A.M.: Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica104(1), 1–15 (2000).https://doi.org/10.1016/S0001-6918(99) 00050-5
-
[26]
Röttger, P., Kirk, H.R., Vidgen, B., et al.: Xstest: A test suite for identifying exaggeratedsafetybehavioursinlargelanguagemodels.In:NAACL(2024).https: //doi.org/10.48550/arXiv.2308.01263
work page internal anchor Pith review doi:10.48550/arxiv.2308.01263 2024
-
[27]
Coun- cil of Europe Publishing (2003)
Stradling, R.: Multiperspectivity in History Teaching: A Guide for Teachers. Coun- cil of Europe Publishing (2003)
2003
-
[28]
UNESCO Pub- lishing, Paris (2023)
UNESCO: Guidance for generative AI in education and research. UNESCO Pub- lishing, Paris (2023)
2023
-
[29]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Q., Bansal, G., Zhang, et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023).https://doi. org/10.48550/arXiv.2308.08155
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155 2023
- [30]
-
[31]
Complexity2021, 1–18 (2021).https://doi.org/ 10.1155/2021/8812542
Zhai, X., Chu, X., Chai, C.S., et al.: A review of artificial intelligence (ai) in education from 2010 to 2020. Complexity2021, 1–18 (2021).https://doi.org/ 10.1155/2021/8812542
-
[32]
S iren ' s Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Zhang, Y., Li, Y., Cui, L., et al.: Siren’s song in the ai ocean: A survey on hal- lucination in large language models. Computational Linguistics (2025).https: //doi.org/10.1162/COLI.a.16
-
[33]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, et al.: Judging LLM-as-a-judge with MT-Bench and chatbot arena. In: Advances in Neural Information Processing Sys- tems (NeurIPS). vol. 36 (2023),https://arxiv.org/abs/2306.05685
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.