ANVIL: Analogies and Videos for Lecturers
Pith reviewed 2026-05-21 00:50 UTC · model grok-4.3
The pith
ANVIL automates the generation of analogy-based animations for teaching computer science concepts, yielding materials educators rate as adequate and usable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a multimodal generative pipeline can create analogy-driven instructional animations for computer science topics that meet basic standards of quality as judged by teachers, supported by scalable automated evaluation methods and positive feedback on usability from educators.
What carries the argument
The ANVIL pipeline, which converts concept definitions into textual analogies, then into structured visual screenplays, and finally into executable Manim code with an automated repair mechanism to enhance output reliability.
If this is right
- Generated animations can supplement or replace manual creation of teaching visuals for computer science topics.
- Educators may integrate such tools into their workflow to produce materials more efficiently.
- Automated evaluators and fidelity proxies enable quality assessment at scales larger than manual reviews alone.
- Repair mechanisms in code generation improve the success rate of producing functional animations.
Where Pith is reading between the lines
- Extending ANVIL to non-computer science subjects could broaden its utility in education.
- Testing whether these animations improve student learning outcomes would provide stronger evidence of effectiveness.
- Addressing potential biases in the automated screening process could strengthen the reliability of the quality claims.
Load-bearing premise
The assumption that the LLM-based evaluator and the automated video-fidelity proxy accurately reflect pedagogical quality as judged by human teachers, without systematic bias in the screening process.
What would settle it
A larger-scale evaluation where independent human teachers review many ANVIL-generated animations and rate a substantial portion as inadequate for teaching purposes.
Figures
read the original abstract
We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ANVIL, a multimodal generative system that automates production of analogy-based instructional animations for computer science topics. Given a concept definition, it generates a textual analogy, compiles it into a structured visual screenplay, produces executable Manim code with an automated repair mechanism, and evaluates the output via an initial teacher evaluation to guide scalable automated screening (LLM-based evaluator for textual analogies; fidelity-to-screenplay proxy for videos) followed by a user study examining educator adoption, value, and usability. The central claim is that ANVIL materials are frequently rated as adequate and that educators respond positively to its perceived value and usability.
Significance. If the evaluation claims hold under rigorous quantitative validation, ANVIL would represent a meaningful step toward scalable, analogy-driven CS education tools that combine LLM generation with human-grounded screening. The workflow of grounding automated proxies in an initial teacher study and the focus on both textual and visual fidelity address real bottlenecks in educational content creation; reproducible pipelines or falsifiable predictions about analogy quality would further strengthen its contribution.
major comments (2)
- [Evaluation section] The abstract and evaluation description state that materials are 'frequently rated as adequate' after automated screening guided by teacher evaluation, yet no quantitative metrics (e.g., rating distributions, sample sizes, inter-rater reliability, or error analysis) are provided to support this claim. This absence leaves the central empirical finding without visible supporting data or controls.
- [Automated screening and proxy description] The workflow generates candidates, screens them with the LLM evaluator and video-fidelity proxy, then reports human ratings only on survivors. No correlation coefficient, confusion matrix, or held-out validation between automated scores and the original teacher ratings is described, so it is unclear whether the proxies preserve pedagogical signal or merely filter for format compliance and surface fluency.
minor comments (2)
- [LLM evaluator subsection] Clarify the exact criteria and prompt used by the LLM evaluator for textual analogies, including any rubrics for pedagogical alignment versus surface features.
- [Video fidelity proxy] The video proxy is described only at a high level; specify the concrete checks performed (object presence, timing, etc.) and any failure modes observed in the error analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that strengthening the quantitative support for our evaluation claims and validating the automated proxies will improve the manuscript. We address each major comment below and will incorporate the suggested enhancements in the revision.
read point-by-point responses
-
Referee: [Evaluation section] The abstract and evaluation description state that materials are 'frequently rated as adequate' after automated screening guided by teacher evaluation, yet no quantitative metrics (e.g., rating distributions, sample sizes, inter-rater reliability, or error analysis) are provided to support this claim. This absence leaves the central empirical finding without visible supporting data or controls.
Authors: We agree that explicit quantitative metrics are needed to substantiate the claim. The teacher evaluation was performed to ground the quality assessment and guide the design of the automated screening, but the initial submission described the process at a high level without including rating distributions, sample sizes, inter-rater reliability, or error analysis. In the revised manuscript we will add these details, along with the relevant controls, to make the supporting data visible. revision: yes
-
Referee: [Automated screening and proxy description] The workflow generates candidates, screens them with the LLM evaluator and video-fidelity proxy, then reports human ratings only on survivors. No correlation coefficient, confusion matrix, or held-out validation between automated scores and the original teacher ratings is described, so it is unclear whether the proxies preserve pedagogical signal or merely filter for format compliance and surface fluency.
Authors: We acknowledge the importance of demonstrating that the proxies retain pedagogical signal. The manuscript explains that the teacher evaluation informed the LLM evaluator and fidelity proxy, yet direct quantitative validation (correlation coefficients, confusion matrices, or held-out analysis) was not reported. We will add this analysis in the revision to show the relationship between automated scores and the original teacher ratings. revision: yes
Circularity Check
No circularity: evaluation steps remain independent of generation and screening pipeline
full rationale
The paper grounds its claims in an initial teacher evaluation whose findings are used only to design prompts and criteria for subsequent LLM screening and video-fidelity proxy; final adequacy rates and usability judgments come from separate human ratings and an educator user study. No equations, fitted parameters, self-citations, or ansatzes appear in the provided text. The workflow explicitly separates generation, automated filtering, and human assessment, so the reported adequacy rate is not forced by construction from the screening rules themselves.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models can produce pedagogically useful analogies for computer science concepts when given a definition
- domain assumption An automated proxy measuring fidelity to a screenplay is a sufficient stand-in for human judgment of video quality
invented entities (1)
-
ANVIL pipeline
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ANVIL employs a staged pipeline that generates an analogy (Textual Layer), compiles it into a structured visual screenplay (Screenplay Layer), and generates executable Python manim code (Code Layer).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Arias, M.: marcelo-earth/generative-manim, https://github.com/marcelo-earth/ generative-manim ANVIL 13
-
[2]
Bernstein, S., Denny, P., Leinonen, J., Kan, L., Hellas, A., Littlefield, M., Sarsa, S., Macneil, S.: "Like a Nesting Doll": Analyzing Recursion Analogies Generated by CS Students Using Large Language Models. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ITiCSE 2024, Association for Computing Machinery, New Y...
-
[3]
Bhavya, B., Palaguachi, C., Zhou, Y., Bhat, S., Zhai, C.: Long-Form Anal- ogy Evaluation Challenge. In: Mille, S., Clinciu, M.A. (eds.) Proceedings of the 17th International Natural Language Generation Conference: Generation Chal- lenges. Association for Computational Linguistics, Tokyo, Japan (Sep 2024), https://aclanthology.org/2024.inlg-genchal.1/
work page 2024
-
[4]
In: Shaikh, S., Ferreira, T., Stent, A
Bhavya, B., Xiong, J., Zhai, C.: Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT. In: Shaikh, S., Ferreira, T., Stent, A. (eds.) Proceedings of the 15th International Conference on Natural Language Generation. Association for Computational Linguistics, Waterville, Maine, USA and virtual meeting (Jul 2022). https://doi.or...
-
[5]
Bouchey, B., Castek, J., Thygeson, J.: Multimodal Learning. In: Ryoo, J., Winkel- mann, K. (eds.) Innovative Learning Environments in STEM Higher Education: Opportunities, Challenges, and Looking Forward. Springer International Publish- ing, Cham (2021). https://doi.org/10.1007/978-3-030-58948-6_3
-
[6]
Using Thematic Analysis in Psychology
Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative Research in Psychology3(2) (Jan 2006). https://doi.org/10.1191/1478088706qp063oa
-
[7]
arXiv preprint arXiv:2510.01174 (2025)
Chen, Y., Lin, K.Q., Shou, M.Z.: Code2video: A code-centric paradigm for educa- tional video generation. arXiv preprint arXiv:2510.01174 (2025)
-
[8]
Educational Psychology Review3(3) (Sep 1991)
Clark, J.M., Paivio, A.: Dual coding theory and education. Educational Psychology Review3(3) (Sep 1991). https://doi.org/10.1007/BF01320076
-
[9]
Proceedings of the National Academy of Sciences 111(23) (Jun 2014)
Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H., Wenderoth, M.P.: Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences 111(23) (Jun 2014). https://doi.org/10.1073/pnas.1319030111
-
[10]
Gentner, D.: The mechanisms of analogical learning. In: Ortony, A., Vosniadou, S. (eds.) Similarity and Analogical Reasoning. Cambridge University Press, Cam- bridge (1989). https://doi.org/10.1017/CBO9780511529863.011
-
[11]
American Psychologist52(1) (1997)
Gentner, D., Holyoak, K.J.: Reasoning and learning by analogy: Introduction. American Psychologist52(1) (1997). https://doi.org/10.1037/0003-066X.52.1.32, place: US Publisher: American Psychological Association
-
[12]
Amer- ican psychologist52(1), 45 (1997)
Gentner, D., Markman, A.B.: Structure mapping in analogy and similarity. Amer- ican psychologist52(1), 45 (1997)
work page 1997
-
[13]
In: Proceeding of the 44th ACM technical symposium on Computer science education
Guo, P.J.: Online python tutor: embeddable web-based program visualization for cs education. In: Proceeding of the 44th ACM technical symposium on Computer science education. SIGCSE ’13, Association for Computing Machinery, New York, NY, USA (Mar 2013). https://doi.org/10.1145/2445196.2445368
-
[14]
In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education
Harsley, R., Green, N., Alizadeh, M., Acharya, S., Fossati, D., Di Eugenio, B., AlZoubi, O.: Incorporating Analogies and Worked Out Examples as Pedagogi- cal Strategies in a Computer Science Tutoring System. In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education. SIGCSE ’16, Association for Computing Machinery, New York, NY, US...
-
[15]
He, G., Balayn, A., Buijsman, S., Yang, J., Gadiraju, U.: It Is Like Finding a Polar Bear in the Savannah! Concept-level AI Explanations with Analogical Inference 14 Y. Noviello et al. from Commonsense Knowledge: HCOMP 2022: 10th AAAI Conference on Human Computation and Crowdsourcing. Proceedings of the Tenth AAAI Conference on Human Computation and Crowd...
-
[16]
Holmes, W., Miao, F., et al.: Guidance for generative AI in education and research. Unesco Publishing (2023)
work page 2023
-
[17]
In: Rogers, A., Boyd-Graber, J., Okazaki, N
Hu, X., Storks, S., Lewis, R., Chai, J.: In-Context Analogical Reasoning with Pre- Trained Language Models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.o...
-
[18]
In: Bouamor, H., Pino, J., Bali, K
Jiayang, C., Qiu, L., Chan, T., et al.: StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Singapore (Dec 2023). https://doi...
-
[19]
METRON83(1), 111–140 (Apr 2025)
Kankaraš, M., Capecchi, S.: Neither agree nor disagree: use and misuse of the neutral response category in likert-type scales. METRON83(1), 111–140 (Apr 2025). https://doi.org/10.1007/s40300-024-00276-5
-
[20]
In: 2024 47th MIPRO ICT and Electronics Convention (MIPRO) (May 2024)
Marković, M., Kaštelan, I.: Demonstrating the Potential of Visualization in Ed- ucation with the Manim Python Library: Examples from Algorithms and Data Structures. In: 2024 47th MIPRO ICT and Electronics Convention (MIPRO) (May 2024). https://doi.org/10.1109/MIPRO60963.2024.10569661
-
[21]
Multimedia learning, 2nd ed, Cam- bridge University Press, New York, NY, US (2009)
Mayer, R.E.: Multimedia learning, 2nd ed. Multimedia learning, 2nd ed, Cam- bridge University Press, New York, NY, US (2009). https://doi.org/10.1017/ CBO9780511811678
work page 2009
-
[22]
IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS
Mittal, U., Sai, S., Chamola, V., Sangwan, D.: A Comprehensive Review on Gener- ative AI for Education. IEEE Access12(2024). https://doi.org/10.1109/ACCESS. 2024.3468368
-
[23]
Moreno, A., Myller, N., Sutinen, E., Ben-Ari, M.: Visualizing programs with Jeliot
-
[24]
AVI ’04, Association for Computing Machinery, New York, NY, USA (May 2004)
In: Proceedings of the working conference on Advanced visual interfaces. AVI ’04, Association for Computing Machinery, New York, NY, USA (May 2004). https: //doi.org/10.1145/989863.989928
-
[25]
https://doi.org/10.48550/arXiv.2507.14306
P, S., Jain, V., Golugula, S., Sathvik, M.S.: Manimator: Transforming Research Pa- pers into Visual Explanations (2025). https://doi.org/10.48550/arXiv.2507.14306
-
[26]
Pramling, N.: Learning and Metaphor: Bridging the Gap Between the Famil- iar and the Unfamiliar. In: Fleer, M., Pramling, N. (eds.) A Cultural-Historical Study of Children Learning Science: Foregrounding Affective Imagination in Play- based Settings. Springer Netherlands, Dordrecht (2015). https://doi.org/10.1007/ 978-94-017-9370-4_8
work page 2015
-
[27]
Pylint contributors: Pylint (Apr 2025), https://github.com/pylint-dev/pylint
work page 2025
-
[28]
Technology, Knowledge and Learning29(4), 2117–2151 (Dec 2024)
Ring, M., Brahm, T.: A rating framework for the quality of video explanations. Technology, Knowledge and Learning29(4), 2117–2151 (Dec 2024). https://doi. org/10.1007/s10758-022-09635-5
-
[29]
Saxena,P.,Singh,S.K.,Gupta,G.:AchievingEffectiveLearningOutcomesthrough the Use of Analogies in Teaching Computer Science. Mathematics11(15) (2023). https://doi.org/10.3390/math11153340
-
[30]
Sehgal, S., Bhavya, Datta, K.P., Mallavarapu, A., Zhai, C.X.: Exploring AI- powered Multimodal Analogies for Science Education: 2024 Joint of the Human- Centric eXplainable AI in Education and the Leveraging Large Language Models ANVIL 15 for Next Generation Educational Technologies Workshops, HEXED-L3MNGET 2024.CEURWorkshopProceedings3840(2024),https://w...
-
[31]
Shao, Z., Yuan, S., Gao, L., He, Y., Yang, D., Chen, S.: Unlocking Scientific Con- cepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice? In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25, Association for Computing Machinery, New York, NY, USA (Apr 2025). https://doi.org/1...
-
[32]
In: Aurisano, J., Laramee, R.S., Nobre, C
Sibia, N., Liut, M., Nobre, C.: Exploring the Role of Visualization Tools in En- hancing Computing Education: A Systematic Literature Review. In: Aurisano, J., Laramee, R.S., Nobre, C. (eds.) EuroVis 2025 - Education Papers. The Eurograph- ics Association (2025). https://doi.org/10.2312/eved.20251027
-
[33]
Snelson, C.: Quest-Based Learning: A Scoping Review of the Research Literature. TechTrends66(2) (Mar 2022). https://doi.org/10.1007/s11528-021-00674-w
-
[34]
In: Duh, K., Gomez, H., Bethard, S
Sultan, O., Bitton, Y., Yosef, R., Shahaf, D.: ParallelPARC: A scalable pipeline for generating natural-language analogies. In: Duh, K., Gomez, H., Bethard, S. (eds.) Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 5900–5924. Assoc...
-
[35]
The Manim Community Developers: Manim – Mathematical Animation Frame- work (Jan 2025), https://www.manim.community/
work page 2025
-
[36]
BMC Medical Research Methodology13(1), 61 (Apr 2013)
Wongpakaran, N., Wongpakaran, T., Wedding, D., Gwet, K.L.: A comparison of cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Medical Research Methodology13(1), 61 (Apr 2013). https://doi.org/10.1186/1471-2288-13-61
-
[37]
Advances in neural information processing systems36, 46595–46623 (2023)
Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.