pith. sign in

arxiv: 2605.26918 · v1 · pith:XRW42YZ7new · submitted 2026-05-26 · 💻 cs.CL

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

Pith reviewed 2026-06-29 18:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords EduVideoBenchvideo generation modelseducational videoKSA frameworkpedagogical evaluationAI in educationbenchmarkingclassroom readiness
0
0 comments X

The pith

EduVideoBench shows frontier video generation models have substantial gaps in educational knowledge, skills, and attitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates EduVideoBench as the first benchmark that applies the Knowledge-Skills-Attitude framework to test whether video generation models produce content suitable for classrooms. Existing evaluations focus on visual quality or safety but ignore whether the videos actually teach effectively and safely. When five leading models are scored on this benchmark, they fall short across all three KSA dimensions and require further development before classroom use. Expert review of the outputs finds that educational validity depends on multiple aligned elements at once, so that one flaw in pacing, notation, or legibility can make an otherwise accurate video unusable. The benchmark is offered as a tool to steer future model development toward pedagogically sound results.

Core claim

EduVideoBench, built on the Knowledge-Skills-Attitude framework, jointly measures pedagogical adequacy and educational safety of generated videos. Across five frontier video generation models the evaluation finds clear shortfalls in all three KSA areas, indicating the models are not yet classroom-ready. Qualitative analysis of expert feedback shows educational validity is multi-component: misalignment in any single element such as pacing, legibility, or notation can invalidate the video even when other parts are correct.

What carries the argument

EduVideoBench, a benchmark that scores generated educational videos on the Knowledge-Skills-Attitude (KSA) framework to assess joint pedagogical adequacy and safety.

If this is right

  • Video generation models need targeted fixes in accurate knowledge delivery, skill development support, and appropriate attitude formation before educational deployment.
  • Pedagogical adequacy and safety must be assessed together rather than as separate quality checks.
  • A single flaw in pacing, legibility, or notation can render an entire video educationally invalid even if the core content is correct.
  • Future video model development should use education-specific benchmarks like EduVideoBench to guide improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark is adopted, model training loops could incorporate KSA scores as an objective to produce videos that better match real teaching needs.
  • The finding that validity is multi-component implies that narrow metrics focused on visual fidelity alone will continue to miss educationally critical failures.
  • Validating the benchmark against measured student outcomes in controlled classroom trials would strengthen or weaken its claim to predict real educational value.
  • Wider use could affect decisions on whether and how AI-generated videos are permitted in school curricula.

Load-bearing premise

Applying the KSA framework to generated videos gives a valid and complete measure of educational adequacy without needing extra domain criteria or checks against real classroom outcomes.

What would settle it

Running the generated videos in actual lessons and measuring whether student learning gains or teacher judgments match the KSA scores produced by the benchmark.

Figures

Figures reproduced from arXiv: 2605.26918 by Chaerin Lee, Haeun Park, Harmony Jung, Hoyoung Ahn, Hye Jin Kim, Hyunji Lee, Jaehyeon Park, Jahyun Jeong, Jeongjin Lee, Seonmin Eun, Seonmin Jin, Soohwan Lee, Sun-ok Ryu, Sunyoung Shin, Unggi Lee, Yeil Jeong, Yoon Choi, Yoorim Son, Young-Seok Oh.

Figure 1
Figure 1. Figure 1: Per-category radar of the five evaluated VGMs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative overview of EduVideoBench. Each KSA tier banner is followed by per-category sub-banners (Knowledge 2, Skills 3, Attitude 4) and one card per category showing a single mid-duration frame from a real generated video. Content warning: the A-NE card depicts an actual safety-gate failure and may include sensitive imagery. tive effects on learning outcomes (Noetel et al., 2021; Brame, 2016) and a wel… view at source ↗
Figure 3
Figure 3. Figure 3: Descriptive statistics of EduVideoBench. Three horizontal panels project the same 215 prompts along the three coverage axes a reader needs to assess the dataset. Left: Subject × Category coverage matrix shows whether every educational dimension is exercised across every subject. Center: Per-subject grade-band composition shows whether each subject covers the full elementary-to-college spectrum. Right: Cate… view at source ↗
Figure 4
Figure 4. Figure 4: Example EduVideoBench prompt record. Each of the 215 prompts follows this schema; the full set is released as JSON. 3.8 Human Expert Evaluation The benchmark is scored by 18 domain experts (two per subject across the nine educational do￾mains, namely mathematics, science, social stud￾ies, ELA, informatics, music, physical education, visual arts, and a cross-subject bucket), all holding a doctoral degree or… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-cutting result views. Left: overall score versus grade band; every model degrades from elementary to college. Center: A-NE refusal rate per threat type (D/H/V/X); the dashed line at 0.50 is the safety gate, and three of five models refuse nothing. Right: per-model trade-off between cost per 8-second clip (USD) and overall KSA score; markers are “•” for safety-gate pass and “×” for fail. Models follow… view at source ↗
Figure 6
Figure 6. Figure 6: Subject × Model heatmap of overall scores; columns follow the fixed Veo → Sora → Kling → Wan 2.2 → Wan 2.6 order. STEM subjects exhibit larger between-model variance; arts and humanities subjects compress the range. M Cost-Effectiveness Detail Per-clip API prices used in the cost-vs-KSA scat￾ter ( [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: K-CK (Wan 2.6, EVB-Sci-ELow-K-CK-F1- 001) - chemical formula H2O. K-PK (Pedagogical Knowledge). Prompt EVB￾None-EHigh-K-PK-CTML-SE-046 probes the seg￾menting principle: a multi-stage explanation must visibly break into discrete steps with pauses or scene cuts between stages. Wan 2.6 produces a clean three-stage segmentation; the strip below shows the transitions between stages [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 8
Figure 8. Figure 8: K-PK (Wan 2.6, EVB-None-EHigh-K-PK￾CTML-SE-046) - segmenting principle, multi-stage ex￾planation. S-PF (Pedagogical Functions). Prompt EVB￾Math-ELow-S-PF-V2-002 asks for the visualiza￾tion function on elementary mathematics (count￾ing and one-to-one correspondence). The strip shows the model maintaining a stable visual refer￾ent across the three sampled timestamps - the most direct signal that Koumi’s visu… view at source ↗
Figure 9
Figure 9. Figure 9: S-PF (Wan 2.6, EVB-Math-ELow-S-PF-V2- 002) - visualization function, elementary math. S-UC (Use Cases). Prompt EVB-Math-ELow￾S-UC-1d-024 targets the step-by-step tutorial use case. The strip shows the model holding a tutorial￾style framing across the duration; raters credit the visual continuity but note that intermediate steps are sometimes implied rather than fully drawn out. 17 [PITH_FULL_IMAGE:figures… view at source ↗
Figure 10
Figure 10. Figure 10: S-UC (Wan 2.6, EVB-Math-ELow-S-UC￾1d-024) - step-by-step tutorial use case. S-VIU (Video-Informed Understanding). Prompt EVB-Sci-Mid-S-VIU-E2-052 attaches a compre￾hension question whose answer should be infer￾able from the generated video but not from the text prompt alone. S-VIU scores collapse to near zero across all five models; the strip illustrates the gap, where the visual content is on-topic but d… view at source ↗
Figure 14
Figure 14. Figure 14: A-DD (Wan 2.6, EVB-Sci-None-A-DD-1- 088) - design-decision consistency under rephrasing. S Extended Qualitative Analysis This section reports the full version of the qual￾itative analysis summarized in Section 6. We qualitatively analyzed expert notes and ground￾truth/criteria improvement comments to identify recurring boundary conditions that explain why educational video evaluation requires stricter jud… view at source ↗
Figure 11
Figure 11. Figure 11: S-VIU (Wan 2.6, EVB-Sci-Mid-S-VIU-E2- 052) - video-informed understanding probe. A-ES (Epistemic Stance). Prompt EVB-Sci￾None-A-ES-1-001 asks the model to depict a topic where curricular consensus has shifted, scoring whether the model presents the current consensus or the outdated belief. The strip below shows the model handling the framing in line with current consensus [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 12
Figure 12. Figure 12: A-ES (Wan 2.6, EVB-Sci-None-A-ES-1- 001) - epistemic stance on contested content. A-IS (Instructional Stance). Prompt EVB-None￾ELow-A-IS-6-046 names elementary-low as the target audience and asks raters to score whether vocabulary, pace, and visual density match. The strip below shows the model adapting visual den￾sity downward; rater notes credit the visual style match but flag that pace is still slightl… view at source ↗
Figure 13
Figure 13. Figure 13: A-IS (Wan 2.6, EVB-None-ELow-A-IS-6- 046) - instructional stance for elementary-low. A-DD (Design Decision Consistency). Prompt EVB-Sci-None-A-DD-1-088 is one of three para￾phrases of the same target task; the score is 1−CV across the paraphrases for the same model. The strip shows one paraphrase; the other two para￾phrases will be released alongside the dataset [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Numeric formula misrendering, Veo 3.1 on [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Low-resolution on-screen text, Wan 2.2 on [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failed visualization, Kling 3.0 on EVB-Math [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 21
Figure 21. Figure 21: eXploitative threat (X1), Veo 3.1 on EVB [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
read the original abstract

Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces EduVideoBench, the first benchmark for educational video generation models grounded in the Knowledge-Skills-Attitude (KSA) framework. It evaluates five frontier video generation models (VGMs), reports that they show substantial room for improvement across KSA dimensions before classroom readiness, and supplements quantitative scores with qualitative expert analysis indicating that educational validity is multi-component (e.g., a single issue like pacing or notation can invalidate a video).

Significance. If the benchmark's KSA-based scores prove reliable, the work would address a clear gap by providing the first education-specific evaluation that jointly considers pedagogical adequacy and safety rather than perceptual quality alone. The creation of a balanced benchmark and the observation that validity is multi-component are constructive contributions that could guide future VGM development for educational use.

major comments (2)
  1. [Abstract] Abstract: The central claim that five frontier VGMs show 'substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready' is presented without any details on the evaluation protocol, number of videos generated or rated, inter-rater reliability, or how the KSA dimensions were operationalized into concrete scoring criteria. These omissions are load-bearing because the headline result cannot be assessed or reproduced from the given information.
  2. [Results / Discussion (qualitative analysis)] The manuscript's conclusion that the KSA-derived scores constitute a valid proxy for educational adequacy (and thus classroom readiness) rests on an unvalidated assumption. No correlation study, criterion validation against measurable student learning gains, retention, or error rates in actual classroom settings is reported, leaving the benchmark's downstream utility unanchored.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that five frontier VGMs show 'substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready' is presented without any details on the evaluation protocol, number of videos generated or rated, inter-rater reliability, or how the KSA dimensions were operationalized into concrete scoring criteria. These omissions are load-bearing because the headline result cannot be assessed or reproduced from the given information.

    Authors: We agree that the abstract is too condensed to support the central claim on its own. The body of the manuscript (Section 3) contains the full evaluation protocol, the number of videos generated and rated, inter-rater reliability statistics, and the concrete rubrics used to operationalize each KSA dimension. We will revise the abstract to incorporate a concise summary of these elements so that the headline result is transparent and reproducible from the abstract alone. revision: partial

  2. Referee: [Results / Discussion (qualitative analysis)] The manuscript's conclusion that the KSA-derived scores constitute a valid proxy for educational adequacy (and thus classroom readiness) rests on an unvalidated assumption. No correlation study, criterion validation against measurable student learning gains, retention, or error rates in actual classroom settings is reported, leaving the benchmark's downstream utility unanchored.

    Authors: We accept this observation. EduVideoBench is presented as an expert-rated benchmark grounded in the established KSA framework; we do not claim or demonstrate criterion validity against direct measures of student learning. Such validation would require controlled classroom experiments measuring learning gains, which lies outside the scope of a benchmark-introduction paper. We will add an explicit limitations paragraph clarifying the current scope and identifying full criterion validation as important future work. revision: no

Circularity Check

0 steps flagged

No circularity: new benchmark applies standard KSA framework directly

full rationale

The paper introduces EduVideoBench as a new evaluation set grounded in the established Knowledge-Skills-Attitude (KSA) framework from education literature. No equations, fitted parameters, or predictions are described. The central claim consists of empirical scores on generated videos plus qualitative expert comments; these do not reduce to any self-referential definition or prior fitted quantity by construction. The KSA application is an external standard applied to outputs rather than a self-defined loop. No self-citation load-bearing steps or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the suitability of the KSA framework for video evaluation and on the assumption that expert qualitative comments provide reliable signals of educational validity; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption The Knowledge-Skills-Attitude (KSA) framework is an appropriate and sufficient lens for assessing educational validity of generated videos.
    The benchmark is explicitly grounded in KSA so that pedagogical adequacy and safety are evaluated jointly.

pith-pipeline@v0.9.1-grok · 5775 in / 1188 out tokens · 38674 ms · 2026-06-29T18:15:16.156689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Felix Pezoa, Ivana Pletikosa Cvijikj, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Geoffrey Thomas, Slava Tykhonov, and 4 others. 2024. Croissant: A metadata for...

  4. [4]

    Anderson and David R

    Lorin W. Anderson and David R. Krathwohl, editors. 2001. A Taxonomy for Learning, Teaching, and Assessing: A Revision of B loom's Taxonomy of Educational Objectives . Longman

  5. [5]

    Cynthia J. Brame. 2016. Effective educational videos: P rinciples and guidelines for maximizing student learning from video content. CBE--Life Sciences Education, 15(4)

  6. [6]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2024. A survey on LLM-as-a-Judge . arXiv preprint arXiv:2411.15594

  7. [7]

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024. VBench : Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

  8. [8]

    Jack Koumi. 2006. Designing Video and Multimedia for Open and Flexible Learning. Routledge

  9. [9]

    Unggi Lee, Yeil Jeong, Seungha Kim, Yoorim Son, Gyuri Byun, Hyeoncheol Kim, and Cheolil Lim. 2025. How can video generative AI transform K-12 education? examining teachers' perspectives through TPACK and TAM . arXiv preprint arXiv:2503.08003

  10. [10]

    Unggi Lee, Sookbun Lee, Heungsoo Choi, Jinseo Lee, Haeun Park, Younghoon Jeon, Sungmin Cho, Minju Kang, Junbo Koh, Jiyeong Bae, Minwoo Nam, Juyeon Eun, Yeonji Jung, and Yeil Jeong. 2026. OpenLearnLM benchmark: A unified framework for evaluating knowledge, skill, and attitude in educational large language models. arXiv preprint arXiv:2601.13882

  11. [11]

    Yibo Liu, Zhenting Yang, Yangming Chen, Yingchaojie Wang, Shan Ai, Haoran Wang, Bin Yu, Liang Zhu, and Qing Liao. 2024. T2VSafetyBench : Evaluating the safety of text-to-video generative models. arXiv preprint arXiv:2407.05965

  12. [12]

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. 2023. FETV : A benchmark for fine-grained evaluation of open-domain text-to-video generation. arXiv preprint arXiv:2311.01813

  13. [13]

    K. M. Megha Mariam, Aditya Arun, Zakaria Laskar, and C. V. Jawahar. 2026. PhyEduVideo : A benchmark for evaluating text-to-video models for physics education. arXiv preprint arXiv:2601.00943

  14. [14]

    Mayer, editor

    Richard E. Mayer, editor. 2014. The C ambridge Handbook of Multimedia Learning , 2nd edition. Cambridge University Press

  15. [15]

    Michael Noetel, Shantell Griffith, Oscar Delaney, Taren Sanders, Philip Parker, Borja del Pozo Cruz, and Chris Lonsdale. 2021. Video improves learning in higher education: A systematic review. Review of Educational Research, 91(2)

  16. [16]

    LLM Evaluators Recognize and Favor Their Own Generations

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. LLM evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076

  17. [17]

    PhyWorldBench Authors . 2025. PhyWorldBench : A benchmark for evaluating physical realism in video generation models. arXiv preprint arXiv:2507.13428

  18. [18]

    John Sweller. 2023. The development of cognitive load theory: R eplication, generalization, and contentious issues. Educational Psychology Review, 35

  19. [19]

    Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, and Xipeng Qiu. 2025. Thinking with video: Video generation as a promising multimodal reasoning paradigm. arXiv preprint arXiv:2511.04570

  20. [20]

    Thadd \"a us Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. 2025. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328

  21. [21]

    Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, and Jizhou Huang. 2026. Beyond end-to-end video models: An LLM -based multi-agent system for educational video generation. arXiv preprint arXiv:2602.11790

  22. [22]

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. 2025. VBench-2.0 : Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755