arxiv: 2604.07883 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.CL· cs.CY· cs.MA

Recognition: unknown

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

Gabriel Stefan , Adrian-Marius Dumitran

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.MA

keywords agentic AIbias detectionhistory textbookseducational content auditmulti-agent systemssource attributionpedagogical evaluation

0 comments

The pith

A multi-agent AI architecture with source attribution protocol reduces over-penalization when auditing history textbook biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an evaluation system built from a multimodal screening agent, a jury of five specialized agents, and a meta-agent that synthesizes verdicts while escalating difficult cases to humans. It tests the system on 270 excerpts from Romanian upper-secondary history textbooks and finds that 83.3 percent are judged pedagogically acceptable with average severity 2.9 out of 7, far milder than the 5.4 score produced by direct zero-shot prompting. A blind comparison with 18 human evaluators shows the independent deliberation mode preferred in 64.8 percent of head-to-head trials. The work matters because manual review of entire textbooks is slow and costly while single large language models tend to flag too much acceptable content as biased.

Core claim

The author claims that an agentic evaluation architecture, built around a heterogeneous jury of five evaluative agents plus a meta-agent for synthesis and supported by a Source Attribution Protocol that separates textbook narrative from quoted historical sources, produces more balanced judgments of pedagogical acceptability than single-model baselines. On the Romanian textbook corpus this yields an 83.3 percent acceptable rate at mean severity 2.9/7 versus 5.4/7 for zero-shot evaluation, and the independent deliberation configuration is chosen by humans in 64.8 percent of blind pairwise comparisons.

What carries the argument

The heterogeneous jury of five evaluative agents together with a meta-agent for verdict synthesis, enabled by the Source Attribution Protocol that prevents misattributing quoted historical sources as textbook narrative.

If this is right

Textbooks can be audited at roughly two dollars each rather than requiring full manual review.
Agentic deliberation lowers the mean severity score assigned to content and therefore reduces over-flagging of acceptable material.
Human evaluators prefer the deliberative configuration over both heuristic variants and zero-shot prompting in blind tests.
The architecture can serve as a scalable decision-support tool for ministries or boards that approve educational materials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same jury-plus-protocol design could be applied to textbooks in other national contexts or non-history subjects where selective framing is a concern.
Escalation thresholds could be tuned so that only a small fraction of excerpts require human review, creating a practical hybrid workflow.
Repeated application across successive editions of a textbook series might reveal whether authors respond to the feedback by changing how they quote sources.
Cross-lingual tests would show whether the Source Attribution Protocol continues to work when historical quotations appear in translation.

Load-bearing premise

The jury agents and meta-agent can combine their outputs into accurate verdicts on pedagogical acceptability without introducing systematic biases of their own, and the Source Attribution Protocol reliably separates textbook narrative from quoted sources in every case.

What would settle it

A follow-up study in which professional historians independently rate the same 270 excerpts for pedagogical acceptability and the resulting agreement rates with the agentic system are compared against agreement rates with the zero-shot baseline.

Figures

Figures reproduced from arXiv: 2604.07883 by Adrian-Marius Dumitran, Gabriel Stefan.

**Figure 1.** Figure 1: The three-stage agentic evaluation architecture. The flow illustrates the screening agent, heterogeneous jury, and meta-agent for verdict synthesis [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: A generated HTML report highlighting extracted historical bias, taxonomy categorization, and assigned severity scores Report Format. Pipeline outputs are serialized as static HTML reports (see [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Severity distributions for the agentic pipeline (dark gray, µ = 2.9) and zero-shot baseline (light gray, µ = 5.4). Samples differ in size and configuration. The distribution is strongly concentrated around severity 3, with a thin right tail (2 excerpts at severity 5; none at 6 or 7). This indicates that the agentic architecture is well-calibrated for deployment: the jury layer avoids both underpenalizatio… view at source ↗

**Figure 4.** Figure 4: Evaluator preferences (N = 54 comparisons). Independent Deliberation was consistently preferred in randomized blind testing across all textbooks [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8\% of cases over both a heuristic variant and the zero-shot baseline. At approximately \$2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a multi-agent setup with a source attribution protocol to audit bias in history textbooks, but the key results lack any expert ground-truth check.

read the letter

The main takeaway is that this work tries to make LLM-based bias detection in educational materials more reliable by splitting the task across a screening agent, five heterogeneous jury agents, a meta-agent for synthesis, and a dedicated protocol to keep quoted historical sources from being misread as textbook bias. On 270 excerpts from Romanian upper-secondary books it reports 83 percent classified as acceptable with mean severity 2.9 out of 7, compared with 5.4 under zero-shot, plus 65 percent human preference for the deliberation version in a small blind test.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an agentic evaluation architecture for auditing historical biases in educational textbooks. It comprises a multimodal screening agent, a heterogeneous jury of five evaluative agents, a meta-agent for verdict synthesis and human escalation, and a Source Attribution Protocol to distinguish textbook narrative from quoted historical sources. In an empirical study on 270 excerpts from Romanian upper-secondary history textbooks, the system classifies 83.3% as pedagogically acceptable (mean severity 2.9/7) versus 5.4/7 for a zero-shot baseline. A blind human evaluation (18 evaluators, 54 comparisons) finds the Independent Deliberation configuration preferred in 64.8% of cases. The work positions the approach as an economically viable (~$2 per textbook) decision-support tool for educational governance.

Significance. If the agentic components reliably track expert pedagogical judgment rather than model priors, this architecture could enable scalable, low-cost auditing of biases in history textbooks, a domain where manual review is resource-intensive. The Source Attribution Protocol directly targets a known failure mode in LLM evaluators, and the reported cost-effectiveness plus human preference data strengthen the case for practical utility. The empirical comparison to a zero-shot baseline provides a clear baseline for assessing the value of deliberation.

major comments (3)

[Empirical Evaluation] Empirical Evaluation section: The headline result that 83.3% of excerpts are acceptable (mean severity 2.9/7) versus the baseline hinges on the heterogeneous jury plus meta-agent producing verdicts that align with actual pedagogical acceptability. No inter-agent consistency metrics, prompt details, or controls for evaluator-model biases are reported, leaving open whether the reduction in severity reflects genuine mitigation or collective leniency.
[Source Attribution Protocol] Source Attribution Protocol subsection: This protocol is presented as the key safeguard against misattributing quoted sources and thereby reducing false positives. However, no external validation (e.g., agreement with historian annotations on the 270 excerpts or a held-out subset) is provided, which is load-bearing for attributing the observed improvement over the zero-shot baseline to the protocol rather than other system elements.
[Human Evaluation] Human Evaluation section: The 64.8% preference for Independent Deliberation is based on 18 evaluators and 54 comparisons. While useful for relative ranking of configurations, the study does not report inter-rater agreement with domain experts or accuracy against ground-truth pedagogical labels, so it cannot confirm that lower severity scores correspond to improved correctness rather than evaluator preference for milder outputs.

minor comments (2)

[Abstract] The 1-7 severity scale is referenced repeatedly but never defined (e.g., what distinguishes level 3 from level 4). Adding an explicit rubric or example annotations would improve interpretability of the mean scores.
A summary table listing the five jury agents, their individual evaluation criteria, and the meta-agent's synthesis rules would clarify the architecture and aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating planned revisions to improve transparency and rigor.

read point-by-point responses

Referee: [Empirical Evaluation] Empirical Evaluation section: The headline result that 83.3% of excerpts are acceptable (mean severity 2.9/7) versus the baseline hinges on the heterogeneous jury plus meta-agent producing verdicts that align with actual pedagogical acceptability. No inter-agent consistency metrics, prompt details, or controls for evaluator-model biases are reported, leaving open whether the reduction in severity reflects genuine mitigation or collective leniency.

Authors: We agree that inter-agent consistency metrics, full prompt details, and explicit discussion of controls for model biases should have been included. In the revised manuscript we will add these elements: pairwise and overall agreement rates among the five jury agents, the complete prompt templates in an appendix, and a dedicated paragraph explaining how the heterogeneous personas plus meta-agent synthesis are intended to mitigate individual model priors. The human preference results provide supporting evidence against simple collective leniency, but we will also add an explicit limitations paragraph acknowledging that absolute alignment with expert pedagogical judgment cannot be proven without ground-truth labels. revision: partial
Referee: [Source Attribution Protocol] Source Attribution Protocol subsection: This protocol is presented as the key safeguard against misattributing quoted sources and thereby reducing false positives. However, no external validation (e.g., agreement with historian annotations on the 270 excerpts or a held-out subset) is provided, which is load-bearing for attributing the observed improvement over the zero-shot baseline to the protocol rather than other system elements.

Authors: We acknowledge that external validation against historian annotations is absent from the current manuscript and represents a genuine limitation for isolating the protocol's contribution. The protocol is implemented as an explicit, auditable instruction within the agent prompts to separate quoted historical sources from textbook narrative. In the revision we will add a limitations section that clearly states this gap and, resources permitting, include preliminary agreement metrics on a small held-out subset of excerpts. We will also clarify in the text that the reported gains are attributable to the full architecture rather than any single component. revision: partial
Referee: [Human Evaluation] Human Evaluation section: The 64.8% preference for Independent Deliberation is based on 18 evaluators and 54 comparisons. While useful for relative ranking of configurations, the study does not report inter-rater agreement with domain experts or accuracy against ground-truth pedagogical labels, so it cannot confirm that lower severity scores correspond to improved correctness rather than evaluator preference for milder outputs.

Authors: We will report inter-rater agreement statistics (e.g., Fleiss' kappa) among the 18 evaluators in the revised version. The evaluators were drawn from education and history backgrounds to the extent possible within the study constraints. We agree that accuracy against expert ground-truth labels would be stronger evidence; however, obtaining such labels for the full set of excerpts was outside the scope and resources of this work. The preference data is presented strictly as a relative utility measure. We will revise the discussion and conclusion to avoid any implication of proven correctness and to emphasize the relative improvement over the baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from external textbook data and independent human evaluation

full rationale

The paper presents an agentic architecture and reports direct empirical outcomes from screening 270 excerpts drawn from Romanian upper-secondary history textbooks, plus a separate blind human preference study involving 18 evaluators and 54 comparisons. These measurements (83.3% acceptable rate, severity scores, and 64.8% preference) are obtained by applying the system to external source material and collecting external judgments; they are not obtained by fitting parameters to the target quantities and then re-deriving those same quantities, nor by any self-referential equations or self-citation chains that would force the reported numbers. The Source Attribution Protocol and jury synthesis are architectural components whose performance is evaluated against the external baseline and human data rather than being presupposed by the results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The architecture rests on assumptions about LLM capabilities for source distinction and bias judgment; no free parameters explicitly fitted in the abstract, though severity scoring implies a chosen scale.

axioms (2)

domain assumption Multimodal agents can process textbook text and images to screen for bias indicators
Invoked in the multimodal screening agent description.
domain assumption Heterogeneous agents produce complementary evaluations that improve overall accuracy when synthesized
Central to the jury and meta-agent design.

pith-pipeline@v0.9.0 · 5489 in / 1361 out tokens · 54241 ms · 2026-05-10T17:09:29.464600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 23 canonical work pages · 12 internal anchors

[1]

Routledge (2014).https://doi.org/10.4324/9780203814383

Apple, M.: Official knowledge: Democratic education in a conservative age, third edition. Routledge (2014).https://doi.org/10.4324/9780203814383

work page doi:10.4324/9780203814383 2014
[2]

Bai, J., Bai, S., Yang, S., et al.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023).https://doi.org/ 10.48550/arXiv.2308.12966

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966 2023
[3]

Bissyandé

Chan, C.M., Chen, W., Su, Y., et al.: Chateval: Towards better llm evaluations via multi-agent debate. In: ICLR (2024).https://doi.org/10.48550/arXiv.2308. 07201

work page doi:10.48550/arxiv.2308 2024
[4]

Agentverse: Facilitat- ing multi-agent collaboration and exploring emergent behaviors in agents

Chen, W., Su, Y., Zuo, et al.: Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In: ICLR (2024).https://doi.org/10.48550/ arXiv.2308.10848

work page arXiv 2024
[5]

Deep Cogito: Cogito v2.1 671b model card.https://huggingface.co/deepcogito (2025)

2025
[6]

DeepSeek-V3 Technical Report

DeepSeek-AI, et al.: Deepseek-v3 technical report (2024).https://doi.org/10. 48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Du, Y., Li, S., Torralba, A., Tenenbaum, et al.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023),https://arxiv.org/abs/2305.14325

work page internal anchor Pith review arXiv 2023
[8]

Hsieh, C.P., Simig, D., et al.: Ruler: What’s the real context size of your long- context language models? In: Proceedings of EMNLP (2024),https://arxiv.org/ abs/2404.06654

work page internal anchor Pith review arXiv 2024
[9]

ACM Computing Surveys55(12), 1–38 (2023).https://doi.org/10

Ji, Z., Lee, N., Frieske, R., et al.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023).https://doi.org/10. 1145/3571730

2023
[10]

Jiang, A.Q., Sablayrolles, A., Roux, A., et al.: Mixtral of experts (2024).https: //doi.org/10.48550/arXiv.2401.04088

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
[11]

British Journal of Applied Science & Technology7(4), 396–403 (2015).https://doi.org/ 10.9734/BJAST/2015/14975

Joshi, A., Kale, S., Chandel, et al.: Likert scale: Explored and explained. British Journal of Applied Science & Technology7(4), 396–403 (2015).https://doi.org/ 10.9734/BJAST/2015/14975

work page doi:10.9734/bjast/2015/14975 2015
[12]

Learning and Individ- ual Differences103, 102274 (2023).https://doi.org/10.1016/j.lindif.2023

Kasneci, E., Sessler, K., Küchemann, S., et al.: ChatGPT for good? on opportuni- ties and challenges of large language models for education. Learning and Individ- ual Differences103, 102274 (2023).https://doi.org/10.1016/j.lindif.2023. 102274

work page doi:10.1016/j.lindif.2023 2023
[13]

Kimi Team, Bai, Y., et al.: Kimi k2: Open agentic intelligence (2025).https: //doi.org/10.48550/arXiv.2507.20534

work page internal anchor Pith review doi:10.48550/arxiv.2507.20534 2025
[14]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., Perez, E., Piktus, A., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: NeurIPS (2020).https://doi.org/10.48550/ arXiv.2005.11401

work page internal anchor Pith review arXiv 2020
[15]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., et al.: Holistic evaluation of language models. In: TMLR (2023).https://doi.org/10.48550/arXiv.2211.09110

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2023
[16]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, N.F., Lin, K., Chen, J., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157–173 (2024).https://doi.org/10.1162/tacl_a_00638 16 G. S ,tefan and A. M. Dumitran

work page doi:10.1162/tacl_a_00638 2024
[17]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Y., Iter, D., Xu, Y., et al.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of EMNLP. pp. 2511–2522 (2023).https:// doi.org/10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[18]

Pearson London (2016)

Luckin, R., Holmes, W., Griffiths, M., Forcier, L.B.: Intelligence Unleashed: An Argument for AI in Education. Pearson London (2016)

2016
[19]

In: AERA Open (2020).https://doi.org/10.1177/ 2332858420940312

Lucy, L., Demszky, D., Bromley, P., Jurafsky, D.: Content analysis of text- books via natural language processing: Findings on gender, race, and ethnicity in texas us history textbooks. In: AERA Open (2020).https://doi.org/10.1177/ 2332858420940312

2020
[20]

Meta AI, et al.: Llama 4 maverick.https://ai.meta.com/llama/(2025)

2025
[21]

Ministerul Educat,iei: Comunicat de presă nr. 109/2025 privind constituirea grupurilor de lucru pentru elaborarea programelor s,colare.https://www.edu.ro/ press_rel_109_2025_grupuri_lucru_programe_scolare_inv_liceal(2025), ac- cessed: 2025-02-24

2025
[22]

OpenAI: Gpt-5.2 technical specifications.https://developers.openai.com/api/ docs/models/gpt-5.2(2026)

2026
[23]

OpenAI, Agarwal, S., Ahmad, L., Ai, J., Altman, S., et al.: gpt-oss-120b & gpt- oss-20b model card (2025).https://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
[24]

UN- ESCO, Paris (2010)

Pingel, F.: UNESCO Guide on Textbook Research and Textbook Revision. UN- ESCO, Paris (2010)

2010
[25]

Acta Psychologica104(1), 1–15 (2000).https://doi.org/10.1016/S0001-6918(99) 00050-5

Preston, C.C., Colman, A.M.: Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica104(1), 1–15 (2000).https://doi.org/10.1016/S0001-6918(99) 00050-5

work page doi:10.1016/s0001-6918(99 2000
[26]

Röttger, P., Kirk, H.R., Vidgen, B., et al.: Xstest: A test suite for identifying exaggeratedsafetybehavioursinlargelanguagemodels.In:NAACL(2024).https: //doi.org/10.48550/arXiv.2308.01263

work page internal anchor Pith review doi:10.48550/arxiv.2308.01263 2024
[27]

Coun- cil of Europe Publishing (2003)

Stradling, R.: Multiperspectivity in History Teaching: A Guide for Teachers. Coun- cil of Europe Publishing (2003)

2003
[28]

UNESCO Pub- lishing, Paris (2023)

UNESCO: Guidance for generative AI in education and research. UNESCO Pub- lishing, Paris (2023)

2023
[29]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Q., Bansal, G., Zhang, et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023).https://doi. org/10.48550/arXiv.2308.08155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155 2023
[30]

Xu, Y., Li, M., Cui, L., et al.: Layoutlm: Pre-training of text and layout for docu- mentimageunderstanding.In:ACMSIGKDD(2020).https://doi.org/10.1145/ 3394486.3403172

work page arXiv 2020
[31]

Complexity2021, 1–18 (2021).https://doi.org/ 10.1155/2021/8812542

Zhai, X., Chu, X., Chai, C.S., et al.: A review of artificial intelligence (ai) in education from 2010 to 2020. Complexity2021, 1–18 (2021).https://doi.org/ 10.1155/2021/8812542

work page doi:10.1155/2021/8812542 2010
[32]

S iren ' s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Zhang, Y., Li, Y., Cui, L., et al.: Siren’s song in the ai ocean: A survey on hal- lucination in large language models. Computational Linguistics (2025).https: //doi.org/10.1162/COLI.a.16

work page doi:10.1162/coli.a.16 2025
[33]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, et al.: Judging LLM-as-a-judge with MT-Bench and chatbot arena. In: Advances in Neural Information Processing Sys- tems (NeurIPS). vol. 36 (2023),https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023