pith. sign in

arxiv: 2605.16295 · v1 · pith:4XMGBEOVnew · submitted 2026-04-15 · 💻 cs.CY · cs.AI· cs.CL· cs.GR· cs.HC· cs.MM

ANVIL: Analogies and Videos for Lecturers

Pith reviewed 2026-05-21 00:50 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.GRcs.HCcs.MM
keywords analogiesinstructional animationsgenerative AIcomputer science educationeducational technologyuser studyManim
0
0 comments X

The pith

ANVIL automates the generation of analogy-based animations for teaching computer science concepts, yielding materials educators rate as adequate and usable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ANVIL takes a concept definition and produces a textual analogy, structures it as a visual screenplay, and generates Manim code to create an animation, including automatic repairs for robustness. The authors first gather feedback from teachers to understand quality, then build an LLM-based evaluator for the analogies and an automated fidelity check for the videos to allow larger-scale assessment. They also run a user study with educators to explore how such a tool might fit into teaching practice and what concerns it raises. The results indicate that the system often produces adequate materials and that educators see potential value in it for creating instructional content. This approach addresses the challenge of making engaging visual explanations without extensive manual effort from lecturers.

Core claim

The central discovery is that a multimodal generative pipeline can create analogy-driven instructional animations for computer science topics that meet basic standards of quality as judged by teachers, supported by scalable automated evaluation methods and positive feedback on usability from educators.

What carries the argument

The ANVIL pipeline, which converts concept definitions into textual analogies, then into structured visual screenplays, and finally into executable Manim code with an automated repair mechanism to enhance output reliability.

If this is right

  • Generated animations can supplement or replace manual creation of teaching visuals for computer science topics.
  • Educators may integrate such tools into their workflow to produce materials more efficiently.
  • Automated evaluators and fidelity proxies enable quality assessment at scales larger than manual reviews alone.
  • Repair mechanisms in code generation improve the success rate of producing functional animations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending ANVIL to non-computer science subjects could broaden its utility in education.
  • Testing whether these animations improve student learning outcomes would provide stronger evidence of effectiveness.
  • Addressing potential biases in the automated screening process could strengthen the reliability of the quality claims.

Load-bearing premise

The assumption that the LLM-based evaluator and the automated video-fidelity proxy accurately reflect pedagogical quality as judged by human teachers, without systematic bias in the screening process.

What would settle it

A larger-scale evaluation where independent human teachers review many ANVIL-generated animations and rate a substantial portion as inadequate for teaching purposes.

Figures

Figures reproduced from arXiv: 2605.16295 by Anastasiia Birillo, Gosia Migut, Yuri Noviello.

Figure 1
Figure 1. Figure 1: Overview of ANVIL’s generation pipeline. Given a target CS concept, ANVIL (1) synthesizes an analogy, (2) compiles it into visual elements and a structured screenplay, (3) generates an executable manim program and applies a bounded agentic repair loop to fix compilation/runtime errors, and (4) renders the animation. 3 ANVIL Architecture ANVIL integrates analogy generation and automatic creation of manim [3… view at source ↗
Figure 2
Figure 2. Figure 2: Model–View–Controller Pattern as a Restaurant. The View captures the order. The Controller (waiter ) forwards it to the Model (kitchen). The Model processes the order (dish prepared). a single iteration, 2 runs (4%) required two iterations, and 1 run (2%) required three iterations. All runs completed within the preset repair-iteration limit, indi￾cating that the agent reliably resolves common code-generati… view at source ↗
Figure 3
Figure 3. Figure 3: Heatmaps visualizing the expert evaluation results on the artifact set. include: (i) Target Concept Coverage (TCC), assessing whether the analogy cov￾ers properties of the target concept definition; and (ii) Mapping Strength (MS), evaluating the logical consistency of source–target mappings. Together, these criteria capture both the comprehensiveness of the analogy and the appropriate￾ness of the source–ta… view at source ↗
read the original abstract

We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ANVIL, a multimodal generative system that automates production of analogy-based instructional animations for computer science topics. Given a concept definition, it generates a textual analogy, compiles it into a structured visual screenplay, produces executable Manim code with an automated repair mechanism, and evaluates the output via an initial teacher evaluation to guide scalable automated screening (LLM-based evaluator for textual analogies; fidelity-to-screenplay proxy for videos) followed by a user study examining educator adoption, value, and usability. The central claim is that ANVIL materials are frequently rated as adequate and that educators respond positively to its perceived value and usability.

Significance. If the evaluation claims hold under rigorous quantitative validation, ANVIL would represent a meaningful step toward scalable, analogy-driven CS education tools that combine LLM generation with human-grounded screening. The workflow of grounding automated proxies in an initial teacher study and the focus on both textual and visual fidelity address real bottlenecks in educational content creation; reproducible pipelines or falsifiable predictions about analogy quality would further strengthen its contribution.

major comments (2)
  1. [Evaluation section] The abstract and evaluation description state that materials are 'frequently rated as adequate' after automated screening guided by teacher evaluation, yet no quantitative metrics (e.g., rating distributions, sample sizes, inter-rater reliability, or error analysis) are provided to support this claim. This absence leaves the central empirical finding without visible supporting data or controls.
  2. [Automated screening and proxy description] The workflow generates candidates, screens them with the LLM evaluator and video-fidelity proxy, then reports human ratings only on survivors. No correlation coefficient, confusion matrix, or held-out validation between automated scores and the original teacher ratings is described, so it is unclear whether the proxies preserve pedagogical signal or merely filter for format compliance and surface fluency.
minor comments (2)
  1. [LLM evaluator subsection] Clarify the exact criteria and prompt used by the LLM evaluator for textual analogies, including any rubrics for pedagogical alignment versus surface features.
  2. [Video fidelity proxy] The video proxy is described only at a high level; specify the concrete checks performed (object presence, timing, etc.) and any failure modes observed in the error analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that strengthening the quantitative support for our evaluation claims and validating the automated proxies will improve the manuscript. We address each major comment below and will incorporate the suggested enhancements in the revision.

read point-by-point responses
  1. Referee: [Evaluation section] The abstract and evaluation description state that materials are 'frequently rated as adequate' after automated screening guided by teacher evaluation, yet no quantitative metrics (e.g., rating distributions, sample sizes, inter-rater reliability, or error analysis) are provided to support this claim. This absence leaves the central empirical finding without visible supporting data or controls.

    Authors: We agree that explicit quantitative metrics are needed to substantiate the claim. The teacher evaluation was performed to ground the quality assessment and guide the design of the automated screening, but the initial submission described the process at a high level without including rating distributions, sample sizes, inter-rater reliability, or error analysis. In the revised manuscript we will add these details, along with the relevant controls, to make the supporting data visible. revision: yes

  2. Referee: [Automated screening and proxy description] The workflow generates candidates, screens them with the LLM evaluator and video-fidelity proxy, then reports human ratings only on survivors. No correlation coefficient, confusion matrix, or held-out validation between automated scores and the original teacher ratings is described, so it is unclear whether the proxies preserve pedagogical signal or merely filter for format compliance and surface fluency.

    Authors: We acknowledge the importance of demonstrating that the proxies retain pedagogical signal. The manuscript explains that the teacher evaluation informed the LLM evaluator and fidelity proxy, yet direct quantitative validation (correlation coefficients, confusion matrices, or held-out analysis) was not reported. We will add this analysis in the revision to show the relationship between automated scores and the original teacher ratings. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluation steps remain independent of generation and screening pipeline

full rationale

The paper grounds its claims in an initial teacher evaluation whose findings are used only to design prompts and criteria for subsequent LLM screening and video-fidelity proxy; final adequacy rates and usability judgments come from separate human ratings and an educator user study. No equations, fitted parameters, self-citations, or ansatzes appear in the provided text. The workflow explicitly separates generation, automated filtering, and human assessment, so the reported adequacy rate is not forced by construction from the screening rules themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the reliability of LLM-generated analogies and the validity of the automated quality proxies; no free parameters or new physical entities are introduced, but the work rests on domain assumptions about current generative models.

axioms (2)
  • domain assumption Large language models can produce pedagogically useful analogies for computer science concepts when given a definition
    Invoked as the starting point for the textual analogy generation step.
  • domain assumption An automated proxy measuring fidelity to a screenplay is a sufficient stand-in for human judgment of video quality
    Used to enable scalable auditing when subjective video assessment is hard to automate.
invented entities (1)
  • ANVIL pipeline no independent evidence
    purpose: End-to-end automation of analogy-based instructional animations including repair mechanism
    New system introduced by the paper; no independent falsifiable evidence provided beyond the described evaluations.

pith-pipeline@v0.9.0 · 5706 in / 1429 out tokens · 40449 ms · 2026-05-21T00:50:47.261978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Arias, M.: marcelo-earth/generative-manim, https://github.com/marcelo-earth/ generative-manim ANVIL 13

  2. [2]

    Like a Nesting Doll

    Bernstein, S., Denny, P., Leinonen, J., Kan, L., Hellas, A., Littlefield, M., Sarsa, S., Macneil, S.: "Like a Nesting Doll": Analyzing Recursion Analogies Generated by CS Students Using Large Language Models. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ITiCSE 2024, Association for Computing Machinery, New Y...

  3. [3]

    In: Mille, S., Clinciu, M.A

    Bhavya, B., Palaguachi, C., Zhou, Y., Bhat, S., Zhai, C.: Long-Form Anal- ogy Evaluation Challenge. In: Mille, S., Clinciu, M.A. (eds.) Proceedings of the 17th International Natural Language Generation Conference: Generation Chal- lenges. Association for Computational Linguistics, Tokyo, Japan (Sep 2024), https://aclanthology.org/2024.inlg-genchal.1/

  4. [4]

    In: Shaikh, S., Ferreira, T., Stent, A

    Bhavya, B., Xiong, J., Zhai, C.: Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT. In: Shaikh, S., Ferreira, T., Stent, A. (eds.) Proceedings of the 15th International Conference on Natural Language Generation. Association for Computational Linguistics, Waterville, Maine, USA and virtual meeting (Jul 2022). https://doi.or...

  5. [5]

    In: Ryoo, J., Winkel- mann, K

    Bouchey, B., Castek, J., Thygeson, J.: Multimodal Learning. In: Ryoo, J., Winkel- mann, K. (eds.) Innovative Learning Environments in STEM Higher Education: Opportunities, Challenges, and Looking Forward. Springer International Publish- ing, Cham (2021). https://doi.org/10.1007/978-3-030-58948-6_3

  6. [6]

    Using Thematic Analysis in Psychology

    Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative Research in Psychology3(2) (Jan 2006). https://doi.org/10.1191/1478088706qp063oa

  7. [7]

    arXiv preprint arXiv:2510.01174 (2025)

    Chen, Y., Lin, K.Q., Shou, M.Z.: Code2video: A code-centric paradigm for educa- tional video generation. arXiv preprint arXiv:2510.01174 (2025)

  8. [8]

    Educational Psychology Review3(3) (Sep 1991)

    Clark, J.M., Paivio, A.: Dual coding theory and education. Educational Psychology Review3(3) (Sep 1991). https://doi.org/10.1007/BF01320076

  9. [9]

    Proceedings of the National Academy of Sciences 111(23) (Jun 2014)

    Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H., Wenderoth, M.P.: Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences 111(23) (Jun 2014). https://doi.org/10.1073/pnas.1319030111

  10. [10]

    In: Ortony, A., Vosniadou, S

    Gentner, D.: The mechanisms of analogical learning. In: Ortony, A., Vosniadou, S. (eds.) Similarity and Analogical Reasoning. Cambridge University Press, Cam- bridge (1989). https://doi.org/10.1017/CBO9780511529863.011

  11. [11]

    American Psychologist52(1) (1997)

    Gentner, D., Holyoak, K.J.: Reasoning and learning by analogy: Introduction. American Psychologist52(1) (1997). https://doi.org/10.1037/0003-066X.52.1.32, place: US Publisher: American Psychological Association

  12. [12]

    Amer- ican psychologist52(1), 45 (1997)

    Gentner, D., Markman, A.B.: Structure mapping in analogy and similarity. Amer- ican psychologist52(1), 45 (1997)

  13. [13]

    In: Proceeding of the 44th ACM technical symposium on Computer science education

    Guo, P.J.: Online python tutor: embeddable web-based program visualization for cs education. In: Proceeding of the 44th ACM technical symposium on Computer science education. SIGCSE ’13, Association for Computing Machinery, New York, NY, USA (Mar 2013). https://doi.org/10.1145/2445196.2445368

  14. [14]

    In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education

    Harsley, R., Green, N., Alizadeh, M., Acharya, S., Fossati, D., Di Eugenio, B., AlZoubi, O.: Incorporating Analogies and Worked Out Examples as Pedagogi- cal Strategies in a Computer Science Tutoring System. In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education. SIGCSE ’16, Association for Computing Machinery, New York, NY, US...

  15. [15]

    2019), 162–170

    He, G., Balayn, A., Buijsman, S., Yang, J., Gadiraju, U.: It Is Like Finding a Polar Bear in the Savannah! Concept-level AI Explanations with Analogical Inference 14 Y. Noviello et al. from Commonsense Knowledge: HCOMP 2022: 10th AAAI Conference on Human Computation and Crowdsourcing. Proceedings of the Tenth AAAI Conference on Human Computation and Crowd...

  16. [16]

    Unesco Publishing (2023)

    Holmes, W., Miao, F., et al.: Guidance for generative AI in education and research. Unesco Publishing (2023)

  17. [17]

    In: Rogers, A., Boyd-Graber, J., Okazaki, N

    Hu, X., Storks, S., Lewis, R., Chai, J.: In-Context Analogical Reasoning with Pre- Trained Language Models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.o...

  18. [18]

    In: Bouamor, H., Pino, J., Bali, K

    Jiayang, C., Qiu, L., Chan, T., et al.: StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Singapore (Dec 2023). https://doi...

  19. [19]

    METRON83(1), 111–140 (Apr 2025)

    Kankaraš, M., Capecchi, S.: Neither agree nor disagree: use and misuse of the neutral response category in likert-type scales. METRON83(1), 111–140 (Apr 2025). https://doi.org/10.1007/s40300-024-00276-5

  20. [20]

    In: 2024 47th MIPRO ICT and Electronics Convention (MIPRO) (May 2024)

    Marković, M., Kaštelan, I.: Demonstrating the Potential of Visualization in Ed- ucation with the Manim Python Library: Examples from Algorithms and Data Structures. In: 2024 47th MIPRO ICT and Electronics Convention (MIPRO) (May 2024). https://doi.org/10.1109/MIPRO60963.2024.10569661

  21. [21]

    Multimedia learning, 2nd ed, Cam- bridge University Press, New York, NY, US (2009)

    Mayer, R.E.: Multimedia learning, 2nd ed. Multimedia learning, 2nd ed, Cam- bridge University Press, New York, NY, US (2009). https://doi.org/10.1017/ CBO9780511811678

  22. [22]

    IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS

    Mittal, U., Sai, S., Chamola, V., Sangwan, D.: A Comprehensive Review on Gener- ative AI for Education. IEEE Access12(2024). https://doi.org/10.1109/ACCESS. 2024.3468368

  23. [23]

    Moreno, A., Myller, N., Sutinen, E., Ben-Ari, M.: Visualizing programs with Jeliot

  24. [24]

    AVI ’04, Association for Computing Machinery, New York, NY, USA (May 2004)

    In: Proceedings of the working conference on Advanced visual interfaces. AVI ’04, Association for Computing Machinery, New York, NY, USA (May 2004). https: //doi.org/10.1145/989863.989928

  25. [25]

    https://doi.org/10.48550/arXiv.2507.14306

    P, S., Jain, V., Golugula, S., Sathvik, M.S.: Manimator: Transforming Research Pa- pers into Visual Explanations (2025). https://doi.org/10.48550/arXiv.2507.14306

  26. [26]

    In: Fleer, M., Pramling, N

    Pramling, N.: Learning and Metaphor: Bridging the Gap Between the Famil- iar and the Unfamiliar. In: Fleer, M., Pramling, N. (eds.) A Cultural-Historical Study of Children Learning Science: Foregrounding Affective Imagination in Play- based Settings. Springer Netherlands, Dordrecht (2015). https://doi.org/10.1007/ 978-94-017-9370-4_8

  27. [27]

    Pylint contributors: Pylint (Apr 2025), https://github.com/pylint-dev/pylint

  28. [28]

    Technology, Knowledge and Learning29(4), 2117–2151 (Dec 2024)

    Ring, M., Brahm, T.: A rating framework for the quality of video explanations. Technology, Knowledge and Learning29(4), 2117–2151 (Dec 2024). https://doi. org/10.1007/s10758-022-09635-5

  29. [29]

    Mathematics11(15) (2023)

    Saxena,P.,Singh,S.K.,Gupta,G.:AchievingEffectiveLearningOutcomesthrough the Use of Analogies in Teaching Computer Science. Mathematics11(15) (2023). https://doi.org/10.3390/math11153340

  30. [30]

    Sehgal, S., Bhavya, Datta, K.P., Mallavarapu, A., Zhai, C.X.: Exploring AI- powered Multimodal Analogies for Science Education: 2024 Joint of the Human- Centric eXplainable AI in Education and the Leveraging Large Language Models ANVIL 15 for Next Generation Educational Technologies Workshops, HEXED-L3MNGET 2024.CEURWorkshopProceedings3840(2024),https://w...

  31. [31]

    Rzes- zotarski

    Shao, Z., Yuan, S., Gao, L., He, Y., Yang, D., Chen, S.: Unlocking Scientific Con- cepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice? In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25, Association for Computing Machinery, New York, NY, USA (Apr 2025). https://doi.org/1...

  32. [32]

    In: Aurisano, J., Laramee, R.S., Nobre, C

    Sibia, N., Liut, M., Nobre, C.: Exploring the Role of Visualization Tools in En- hancing Computing Education: A Systematic Literature Review. In: Aurisano, J., Laramee, R.S., Nobre, C. (eds.) EuroVis 2025 - Education Papers. The Eurograph- ics Association (2025). https://doi.org/10.2312/eved.20251027

  33. [33]

    TechTrends66(2) (Mar 2022)

    Snelson, C.: Quest-Based Learning: A Scoping Review of the Research Literature. TechTrends66(2) (Mar 2022). https://doi.org/10.1007/s11528-021-00674-w

  34. [34]

    In: Duh, K., Gomez, H., Bethard, S

    Sultan, O., Bitton, Y., Yosef, R., Shahaf, D.: ParallelPARC: A scalable pipeline for generating natural-language analogies. In: Duh, K., Gomez, H., Bethard, S. (eds.) Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 5900–5924. Assoc...

  35. [35]

    The Manim Community Developers: Manim – Mathematical Animation Frame- work (Jan 2025), https://www.manim.community/

  36. [36]

    BMC Medical Research Methodology13(1), 61 (Apr 2013)

    Wongpakaran, N., Wongpakaran, T., Wedding, D., Gwet, K.L.: A comparison of cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Medical Research Methodology13(1), 61 (Apr 2013). https://doi.org/10.1186/1471-2288-13-61

  37. [37]

    Advances in neural information processing systems36, 46595–46623 (2023)

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)