Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

Dayiheng Liu; Haobo Li; Huamin Qu; Jianhong Tu; Kashun Shum; Peng Liu; Rui Sheng; Xiaodong Deng; Zixin Chen

arxiv: 2605.14322 · v2 · pith:KD5A7D35new · submitted 2026-05-14 · 💻 cs.AI

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

Zixin Chen , Peng Liu , Rui Sheng , Haobo Li , Jianhong Tu , Xiaodong Deng , Kashun Shum , Dayiheng Liu

show 1 more author

Huamin Qu

This is my paper

Pith reviewed 2026-05-22 10:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords tutor agentsAI benchmarkingpedagogical judgmentteaching workflowsmulti-turn tutoringeducational AIlanguage agentssituated tutoring

0 comments

The pith

Current AI tutor agents can make basic pedagogical judgments but fall short in real-time tutoring and completing full teaching workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EduAgentBench, a benchmark of 150 tasks that tests language agents on the complete set of teaching responsibilities. Tasks span professional decisions about teaching methods, adapting help across multiple conversation turns with a learner, and finishing workflows inside systems like Canvas. Evaluations of leading models show they manage limited judgment calls but cannot yet sustain effective tutoring or handle independent teaching duties at professional levels. This matters because agents are already being used in education, yet no prior test measured whether they can perform the full job. The work supplies a grounded way to track progress toward agents that could actually assist real teaching.

Core claim

EduAgentBench evaluates tutor agents across three surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Frontier models demonstrate capability in bounded pedagogical judgment but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution.

What carries the argument

EduAgentBench, a source-grounded benchmark of 150 quality-controlled tasks built through a pedagogical-insight-driven pipeline and checked with multiple verification signals plus human review.

Load-bearing premise

The 150 tasks capture the full scope of real-world teaching work and professional standards.

What would settle it

Follow-up tests in which models reach or exceed professional-level scores on the situated tutoring and workflow tasks would challenge the shortfall finding.

Figures

Figures reproduced from arXiv: 2605.14322 by Dayiheng Liu, Haobo Li, Huamin Qu, Jianhong Tu, Kashun Shum, Peng Liu, Rui Sheng, Xiaodong Deng, Zixin Chen.

**Figure 1.** Figure 1: Source-grounded benchmark design and verifier-matched evaluation. EduAgentBench decomposes teacher-level readiness into three stage-specific measurement contracts. Each stage starts from a target educational insight, grounds that insight in external educational sources, deterministic course data, or Canvas-style environment state, and attaches verifiers matched to the evidence the task leaves behind. Human… view at source ↗

**Figure 2.** Figure 2: Verifier-level trajectory for MM-04. The task requires an evidence-to-action chain: retrieve the correct historical quiz, compute KC weaknesses, inspect the assigned deck, edit the existing teaching artifact, create a targeted quiz, and communicate the intervention. Step-level pass/fail markers show where representative model trajectories satisfy or miss the teaching-work contract [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 3.** Figure 3: Artifact-level contrast in MM-04. The source state defines the verifiable target: historical KC weaknesses and the existing slide deck. GPT-5.5 turns that evidence into a targeted slide-and-quiz intervention, while GLM-5.1 produces plausible materials but leaves the required deck unchanged and creates a generic practice artifact. The case illustrates the design principle behind EDUAGENTBENCH. A realistic t… view at source ↗

read the original abstract

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EduAgentBench, a source-grounded benchmark with 150 quality-controlled tasks spanning professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are built via a pedagogical-insight-driven pipeline with complementary verification signals and human review. Evaluation of frontier models leads to the claim that current agents show bounded pedagogical judgment but fall short of professional standards in situated tutoring and autonomous workflow execution. The work positions the benchmark as the first theory-grounded and realistic evaluation for holistic tutor-agent capabilities.

Significance. If the tasks prove representative of real-world teaching, the benchmark fills an important gap by moving beyond answer correctness or tool accuracy to diagnose learner state, adapt support, and execute interventions in realistic systems. The multi-stage design and emphasis on pedagogical grounding could provide a useful measurement foundation for future agent development in education. The absence of fitted parameters or circular reductions in the reported shortfalls is a positive feature.

major comments (2)

[Abstract / benchmark construction] Abstract and benchmark-construction section: the claim that the 150 tasks capture the full scope of professional teaching standards rests on a 'pedagogical-insight-driven pipeline' verified by 'complementary signals plus human review,' yet no explicit mapping is provided from classroom observations, teacher logs, or validated competency frameworks (e.g., Danielson or InTASC standards) to the task set. Without such grounding or inter-rater agreement metrics, observed shortfalls in situated tutoring could reflect benchmark-specific constraints rather than general gaps versus professional practice.
[Evaluation / results] Evaluation section: the central finding that models 'fall short of professional teaching standards' in multi-turn tutoring and workflow execution is load-bearing for the paper's conclusions, but the abstract supplies no quantitative details on how pedagogical evidence is operationalized or on human-review reliability, leaving the external-validity claim vulnerable to unexamined construction choices.

minor comments (2)

[Benchmark description] Clarify the exact number of tasks per capability surface and any stratification by subject or learner level to aid reproducibility.
[Methods] Add a table summarizing inter-rater agreement or verification-signal agreement rates if collected during human review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the grounding and transparency of EduAgentBench. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / benchmark construction] Abstract and benchmark-construction section: the claim that the 150 tasks capture the full scope of professional teaching standards rests on a 'pedagogical-insight-driven pipeline' verified by 'complementary signals plus human review,' yet no explicit mapping is provided from classroom observations, teacher logs, or validated competency frameworks (e.g., Danielson or InTASC standards) to the task set. Without such grounding or inter-rater agreement metrics, observed shortfalls in situated tutoring could reflect benchmark-specific constraints rather than general gaps versus professional practice.

Authors: We agree that an explicit mapping to validated frameworks such as Danielson or InTASC would strengthen claims of alignment with professional standards. Our pipeline was developed from core pedagogical principles in the education literature and validated via complementary signals and human review, but the current manuscript does not include a direct correspondence table or inter-rater agreement metrics. We will revise the benchmark-construction section to add this mapping and report agreement statistics for the human review. revision: yes
Referee: [Evaluation / results] Evaluation section: the central finding that models 'fall short of professional teaching standards' in multi-turn tutoring and workflow execution is load-bearing for the paper's conclusions, but the abstract supplies no quantitative details on how pedagogical evidence is operationalized or on human-review reliability, leaving the external-validity claim vulnerable to unexamined construction choices.

Authors: The abstract is intentionally concise. Full details on the operationalization of pedagogical evidence and human-review reliability appear in the Evaluation and Benchmark Construction sections. To improve transparency, we will revise the abstract to include key quantitative reliability indicators and a brief statement on operationalization. revision: partial

Circularity Check

0 steps flagged

New benchmark evaluated on external models; no internal reductions or self-referential derivations.

full rationale

The paper introduces EduAgentBench as a new source-grounded benchmark with 150 tasks built via a pedagogical-insight-driven pipeline and verified by complementary signals plus human review. Central claims about model capabilities and shortfalls are derived from direct evaluations of frontier models on these tasks. No equations, fitted parameters, or self-citation chains reduce the reported results to quantities defined inside the paper. The construction pipeline and verification are presented as independent of the evaluation outcomes, making the work self-contained against external benchmarks. This yields only a minor score for the inherent assumption that the 150 tasks represent professional standards, which does not constitute circularity per the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that selected tasks and verification signals validly represent professional teaching standards; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Pedagogical judgment, multi-turn tutoring, and workflow execution can be decomposed into discrete, evaluable tasks that reflect real professional standards.
Invoked when constructing the 150 tasks through a pedagogical-insight-driven pipeline.

pith-pipeline@v0.9.0 · 5763 in / 1241 out tokens · 28469 ms · 2026-05-22T10:24:03.660054+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce EduAgentBench, a source-grounded benchmark ... 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Vizqstudio: Iterative visualization literacy mcqs design with simulated students.arXiv preprint arXiv:2603.00994,

Zixin Chen, Yuhang Zeng, Sicheng Song, Yanna Lin, Xian Xu, Huamin Qu, and Meng Xia. Vizqstudio: Iterative visualization literacy mcqs design with simulated students.arXiv preprint arXiv:2603.00994,

work page arXiv
[2]

Charlotte Danielson.Enhancing professional practice: A framework for teaching

URLhttps://arxiv.org/abs/1912.03072. Charlotte Danielson.Enhancing professional practice: A framework for teaching. AsCD,

work page arXiv 1912
[3]

Alireza Ghafarollahi and Markus J

doi: 10.1007/s11257-009-9063-7. Alireza Ghafarollahi and Markus J. Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning,

work page doi:10.1007/s11257-009-9063-7
[4]

E-EV AL: A comprehensive Chinese k-12 education evaluation benchmark for large language models

Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni, and Min Yang. E-EV AL: A comprehensive Chinese k-12 education evaluation benchmark for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7753–7774, Bangkok, Thailand, August

work page 2024
[5]

doi: 10.18653/v1/2024.findings-acl.462

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.462. URL https: //aclanthology.org/2024.findings-acl.462/. Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. CODESIM: Multi-agent code generation and problem solving through simulation-driven planning and debugging,

work page doi:10.18653/v1/2024.findings-acl.462 2024
[6]

URL https://arxiv.org/abs/ 2412.16429. Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for...

work page arXiv
[7]

10 Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang

URLhttps://arxiv.org/abs/2510.25726. 10 Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang. EduEval: A hierarchical cognitive benchmark for evaluating large language models in chinese education,

work page arXiv
[8]

Jakub Macina, Nico Daheim, et al

URLhttps://arxiv.org/abs/2512.00290. Jakub Macina, Nico Daheim, et al. Mathdial: A dialogue tutoring corpus with rich annotations and hierarchical structure. InEMNLP,

work page arXiv
[9]

MathTutorBench: A benchmark for measuring open-ended pedagogical capabilities of LLM tutors

Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. MathTutorBench: A benchmark for measuring open-ended pedagogical capabilities of LLM tutors. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 204–221, Suzhou, China, November

work page 2025
[10]

doi: 10.18653/v1/2025.emnlp-main.11

Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.11. URL https://aclanthology.org/2025. emnlp-main.11/. Ellen B. Mandinach and Edith S. Gummer. What does it mean for teachers to be data literate? Educational Researcher, 45(6):366–376,

work page doi:10.18653/v1/2025.emnlp-main.11 2025
[11]

Richard J

URL https://arxiv.org/abs/2510.02663. Richard J. Stiggins. Assessment crisis: The absence of assessment for learning.Phi Delta Kappan, 83(10):758–765,

work page arXiv
[12]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

URLhttps://arxiv.org/abs/2505.16160. Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM age...

work page arXiv
[14]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

URLhttps://arxiv.org/abs/2412.14161. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

doi: 10.1145/3459637.3482010. URLhttps://doi.org/10.1145/3459637.3482010. Barry J. Zimmerman. Becoming a self-regulated learner: An overview.Theory Into Practice, 41(2): 64–70,

work page doi:10.1145/3459637.3482010

[1] [1]

Vizqstudio: Iterative visualization literacy mcqs design with simulated students.arXiv preprint arXiv:2603.00994,

Zixin Chen, Yuhang Zeng, Sicheng Song, Yanna Lin, Xian Xu, Huamin Qu, and Meng Xia. Vizqstudio: Iterative visualization literacy mcqs design with simulated students.arXiv preprint arXiv:2603.00994,

work page arXiv

[2] [2]

Charlotte Danielson.Enhancing professional practice: A framework for teaching

URLhttps://arxiv.org/abs/1912.03072. Charlotte Danielson.Enhancing professional practice: A framework for teaching. AsCD,

work page arXiv 1912

[3] [3]

Alireza Ghafarollahi and Markus J

doi: 10.1007/s11257-009-9063-7. Alireza Ghafarollahi and Markus J. Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning,

work page doi:10.1007/s11257-009-9063-7

[4] [4]

E-EV AL: A comprehensive Chinese k-12 education evaluation benchmark for large language models

Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni, and Min Yang. E-EV AL: A comprehensive Chinese k-12 education evaluation benchmark for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7753–7774, Bangkok, Thailand, August

work page 2024

[5] [5]

doi: 10.18653/v1/2024.findings-acl.462

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.462. URL https: //aclanthology.org/2024.findings-acl.462/. Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. CODESIM: Multi-agent code generation and problem solving through simulation-driven planning and debugging,

work page doi:10.18653/v1/2024.findings-acl.462 2024

[6] [6]

URL https://arxiv.org/abs/ 2412.16429. Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for...

work page arXiv

[7] [7]

10 Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang

URLhttps://arxiv.org/abs/2510.25726. 10 Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang. EduEval: A hierarchical cognitive benchmark for evaluating large language models in chinese education,

work page arXiv

[8] [8]

Jakub Macina, Nico Daheim, et al

URLhttps://arxiv.org/abs/2512.00290. Jakub Macina, Nico Daheim, et al. Mathdial: A dialogue tutoring corpus with rich annotations and hierarchical structure. InEMNLP,

work page arXiv

[9] [9]

MathTutorBench: A benchmark for measuring open-ended pedagogical capabilities of LLM tutors

Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. MathTutorBench: A benchmark for measuring open-ended pedagogical capabilities of LLM tutors. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 204–221, Suzhou, China, November

work page 2025

[10] [10]

doi: 10.18653/v1/2025.emnlp-main.11

Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.11. URL https://aclanthology.org/2025. emnlp-main.11/. Ellen B. Mandinach and Edith S. Gummer. What does it mean for teachers to be data literate? Educational Researcher, 45(6):366–376,

work page doi:10.18653/v1/2025.emnlp-main.11 2025

[11] [11]

Richard J

URL https://arxiv.org/abs/2510.02663. Richard J. Stiggins. Assessment crisis: The absence of assessment for learning.Phi Delta Kappan, 83(10):758–765,

work page arXiv

[12] [12]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

URLhttps://arxiv.org/abs/2505.16160. Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM age...

work page arXiv

[14] [14]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

URLhttps://arxiv.org/abs/2412.14161. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

doi: 10.1145/3459637.3482010. URLhttps://doi.org/10.1145/3459637.3482010. Barry J. Zimmerman. Becoming a self-regulated learner: An overview.Theory Into Practice, 41(2): 64–70,

work page doi:10.1145/3459637.3482010