pith. sign in

arxiv: 2604.18364 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.GR· cs.MA

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

Pith reviewed 2026-05-10 05:01 UTC · model grok-4.3

classification 💻 cs.AI cs.GRcs.MA
keywords LLMManimanimation generationGRPOreinforcement learningfine-tuningagentic inferencetext-to-code
0
0 comments X

The pith

Training open-source LLMs with GRPO and renderer-in-the-loop inference with documentation produces Manim animations that match or exceed GPT-4.1 in render success and visual similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Manim animation generation demands LLMs master spatial reasoning, timing, and library-specific code, areas where general training falls short. The paper introduces ManimTrainer for combining supervised fine-tuning with group relative policy optimization using rewards that check both code and rendered visuals, plus ManimAgent for inference loops that render, observe errors, and fix using API docs. Testing seventeen models under nine strategy mixes on ManimBench reveals that supervised fine-tuning lifts code quality while the reinforcement step improves visuals and makes models better at self-correction. The top result comes from Qwen 3 Coder 30B after GRPO training and with the documentation-augmented renderer loop, scoring 94 percent successful renders and 85.7 percent visual match to reference videos, three points above the GPT-4.1 baseline. The study also finds that training makes code metrics predict visual quality better, but adding agentic inference at test time weakens that link, showing the two approaches complement each other.

Core claim

The paper establishes that a unified training and inference approach for LLM-based Manim animation generation yields superior results when models undergo supervised fine-tuning followed by group relative policy optimization with a fused code-visual reward, and then employ renderer-in-the-loop inference augmented with API documentation. Under this regime the Qwen 3 Coder 30B model achieves a 94% render success rate and 85.7% visual similarity to reference videos, outperforming the GPT-4.1 baseline by three percentage points in visual similarity. Analysis of metric correlations demonstrates that training phases strengthen the relationship between code quality and visual outcomes, whereas the R

What carries the argument

ManimTrainer, a pipeline for supervised fine-tuning combined with group relative policy optimization using a unified reward signal fusing code and visual assessments, and ManimAgent, an inference pipeline with renderer-in-the-loop and API documentation-augmented renderer-in-the-loop strategies.

If this is right

  • Supervised fine-tuning generally improves code quality in the generated Manim scripts.
  • Group relative policy optimization enhances the visual quality of the resulting animations and increases model responsiveness to correction signals.
  • The correlation between code metrics and visual metrics strengthens after training but weakens when inference enhancements are applied.
  • Certain open-source models under 30 billion parameters can achieve higher visual similarity than GPT-4.1 when using these combined strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed complementarity between training and agentic inference suggests similar hybrid approaches could improve performance in other domains where code must produce verifiable visual or execution outputs.
  • Reliance on custom visual similarity metrics points to the need for developing standardized perceptual benchmarks for animation quality assessment.
  • Extending the ManimBench dataset or applying the pipelines to related libraries could test broader applicability of the training and inference strategies.

Load-bearing premise

The custom render success rate and visual similarity metric serve as reliable proxies for true animation quality and correctness in the absence of human validation studies or more rigorous perceptual evaluation methods.

What would settle it

If a human preference study comparing animations from the top-performing model against those from GPT-4.1 shows that viewers rate the GPT-4.1 outputs as superior or equal in a majority of cases, this would falsify the claim of outperformance based on the reported metrics.

Figures

Figures reproduced from arXiv: 2604.18364 by Ahmad Lotfi, Golnaz Shahtahmassebi, Isibor Kennedy Ihianle, Jordan J. Bird, Ravidu Suien Rammuni Silva.

Figure 1
Figure 1. Figure 1: Overview of the ManimTrainer pipeline. The pipeline uses a quantised base LLM and visually grounds it for [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Inference Pipeline of the ManimAgent. Algorithm 1 Feedback + RAG-Augmented Iterative Code Generation (RITL/RITL-DOC) Require: Textual description d, maximum RITL rounds K, model M, Manim API knowledge base D Ensure: Generated Manim code c, render status s 1: c1 ← M(d) ▷ Initial code generation 2: s1, e1 ← ManimRender(c1) ▷ Render and capture status/errors 3: for k = 1 to K do 4: if sk = success then 5: bre… view at source ↗
Figure 3
Figure 3. Figure 3: Behaviour of CodeBERTBLEU score against Visual Similarity during the training cycles. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaling Behaviour of the Visual Similarity in the fine-tuned LLMs. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Radar plot of CodeBERTBLEU, Visual Similarity, and Render Success Rate of the models grouped by model [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation experiment results conducted to investigate the effect of training strategy on the visual similarity [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation experiment results conducted to investigate the effect of inference strategy on visual similarity [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Variation of mean Visual Similarity of the fine-tuned models across the inference strategies. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of performance between RITL-DOC@1 and RITL-DOC@3 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Spearman Rank Correlation Between Evaluation Metrics calculated for each model under all combinations [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Best Case: Fine-tuned model surpasses the base model. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Worst Case: Base model surpasses fine-tuned model [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Rescue Case: Inference strategy outperforming vanilla inference. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Outlier Case: An example of a smaller Ministral 8B SFT model outperforming Mistral Small 3.2 24B SFT [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples from the top performing models. The top example shows that the GRPO has fixed the issue of text [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
read the original abstract

Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ManimTrainer, a pipeline that combines supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO) using a unified reward fusing code and visual signals, and ManimAgent, an inference pipeline with Renderer-in-the-Loop (RITL) and API-documentation-augmented RITL (RITL-DOC). It evaluates 17 open-source sub-30B LLMs across nine training-inference combinations on ManimBench, reporting that SFT improves code quality, GRPO enhances visual outputs and self-correction responsiveness, and that the Qwen 3 Coder 30B model with GRPO plus RITL-DOC attains the best results: 94% Render Success Rate and 85.7% Visual Similarity, exceeding the GPT-4.1 baseline by 3 percentage points in VS. The work also analyzes how correlations between code and visual metrics strengthen under training but weaken under inference enhancements.

Significance. If the metrics hold, this is the first systematic examination of how SFT, GRPO, and renderer-in-the-loop inference interact for LLM-based Manim animation generation, a task requiring spatial reasoning, temporal sequencing, and domain-specific API knowledge. The scale (17 models, multiple strategy combinations) and concrete trends (SFT for code, GRPO for visuals) supply practical guidance for fine-tuning LLMs on programmatic visual tasks. The complementary roles of training versus agentic inference are a useful empirical contribution to LLM-agent and code-generation research.

major comments (1)
  1. [§4 and §5.2] §4 (Evaluation Metrics) and §5.2 (Results): The headline claim that Qwen 3 Coder 30B + GRPO + RITL-DOC reaches 94% RSR and 85.7% VS (surpassing GPT-4.1 by +3 pp) and that GRPO 'enhances visual outputs' rests entirely on the custom Visual Similarity (VS) metric. The manuscript defines VS via a fused code+visual reward but reports no correlation analysis with human ratings, no comparison to established video metrics (FVD, video-LPIPS, or temporally weighted SSIM), and no ablation showing that VS improvements reflect motion coherence or timing correctness rather than static frame overlap. Because VS is the primary basis for model ranking and for attributing gains to GRPO, this absence is load-bearing for the central empirical conclusions.
minor comments (2)
  1. [Abstract and §3] The abstract and methods would benefit from an explicit statement of ManimBench size, test-case selection criteria, and how reference videos were generated, to allow readers to assess generalizability.
  2. [§3] Notation for the fused reward components and the exact implementation of RITL-DOC could be introduced with a short equation or pseudocode block for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. We address the concern regarding the Visual Similarity metric point by point below and outline targeted revisions.

read point-by-point responses
  1. Referee: [§4 and §5.2] §4 (Evaluation Metrics) and §5.2 (Results): The headline claim that Qwen 3 Coder 30B + GRPO + RITL-DOC reaches 94% RSR and 85.7% VS (surpassing GPT-4.1 by +3 pp) and that GRPO 'enhances visual outputs' rests entirely on the custom Visual Similarity (VS) metric. The manuscript defines VS via a fused code+visual reward but reports no correlation analysis with human ratings, no comparison to established video metrics (FVD, video-LPIPS, or temporally weighted SSIM), and no ablation showing that VS improvements reflect motion coherence or timing correctness rather than static frame overlap. Because VS is the primary basis for model ranking and for attributing gains to GRPO, this absence is load-bearing for the central empirical conclusions.

    Authors: We agree that stronger external validation of VS would increase confidence in the ranking and in the attribution of gains to GRPO. VS is deliberately constructed as a fused signal: after rendering succeeds, it combines (i) code-execution rewards that penalize API misuse or runtime errors with (ii) frame-wise visual similarity computed on the rendered video. This fusion is motivated by the programmatic nature of the task, where pure perceptual metrics can reward visually plausible but semantically incorrect animations. We already report in §5.3 that code-visual metric correlations increase under GRPO, providing indirect support that VS tracks meaningful improvements. Nevertheless, we did not conduct human ratings, direct comparisons against FVD/video-LPIPS, or a dedicated static-vs-temporal ablation. In the revised manuscript we will (a) expand §4 with the precise formulation of the fused reward and its weighting, (b) add a limitations paragraph in §6 explicitly noting the absence of human correlation and established video metrics, and (c) include a limited post-hoc breakdown (on the top-performing models) that separates static-frame overlap from temporal coherence components of VS. These changes clarify the metric’s scope without requiring a full re-evaluation of the 17-model study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper reports an empirical study of LLM training (SFT + GRPO) and inference strategies (RITL, RITL-DOC) for Manim code generation, evaluated on ManimBench via measured Render Success Rate and a custom Visual Similarity metric. No equations, derivations, or fitted-parameter predictions appear in the provided text or abstract; performance claims rest on direct test-set outcomes rather than any self-referential reduction. Self-citations, if present, are not load-bearing for any central result, and the work does not invoke uniqueness theorems or rename known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5643 in / 1152 out tokens · 58022 ms · 2026-05-10T05:01:22.547797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.

Reference graph

Works this paper leans on

42 extracted references · cited by 1 Pith paper

  1. [1]

    Manim – Mathematical Animation Framework, 2026

    The Manim Community Developers. Manim – Mathematical Animation Framework, 2026

  2. [2]

    Evaluating Large Language Models Trained on Code, 2021

    Mark Chen, Tworek, et al. Evaluating Large Language Models Trained on Code, 2021. Version Number: 2

  3. [3]

    Qwen2.5- Coder Technical Report, 2024

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder Technical Report, 2024. Version Number: 3

  4. [4]

    Seed-Coder: Let the Code Model Curate Data for Itself, 2025

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, and Yonghui Wu. Seed-Coder: Let the Code...

  5. [5]

    Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, and Jordan J. Bird. Large Language Model Approaches to Educational Video Generation Using Manim. In Emma Hart, Tomas Horvath, Zhiyuan Tan, and Sarah Thomson, editors,Advances in Computational Intelligence Systems, volume 1468, pages 306–317. Springer Nature Switzerland...

  6. [6]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, 2021. Version Number: 2

  7. [7]

    Qlora: efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: efficient finetuning of quantized llms. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  8. [8]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024. Version Number: 3

  9. [9]

    Teaching Large Language Models to Self- Debug

    Xinyun Chen, Maxwell Lin, Nathanael Schaerli, and Denny Zhou. Teaching Large Language Models to Self- Debug. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 8746–8825, 2024

  10. [10]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20...

  11. [11]

    Xu, Yiqing Xie, Graham Neubig, and Daniel Fried

    Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried. CodeRAG-Bench: Can Retrieval Augment Code Generation?, 2024. Version Number: 2

  12. [12]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis, 2020

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis, 2020. Version Number: 2

  13. [13]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages,

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A Pre-Trained Model for Programming and Natural Languages,

  14. [14]

    SSIM image quality metric for denoised images

    Peter Ndajah, Hisakazu Kikuchi, Masahiro Yukawa, Hidenori Watanabe, and Shogo Muramatsu. SSIM image quality metric for denoised images. InProceedings of the 3rd WSEAS international conference on Visualization, imaging and simulation, pages 53–57, 2010

  15. [15]

    Learning Transferable Visual Models From Natural Language Supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, 2021. Version Number: 1

  16. [16]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, 2024

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...

  17. [17]

    JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence, 2025

    Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, and Fei Yuan. JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence, 2025. Version Number: 1

  18. [18]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforce- ment learning.Nature, 645(8081):633–638, September 2025

    Daya Guo, Dejian Yang, and Zhang and others. DeepSeek-R1 incentivizes reasoning in LLMs through reinforce- ment learning.Nature, 645(8081):633–638, September 2025

  19. [19]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  20. [20]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023

  21. [21]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 8634–8652, 2023

  22. [22]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative Refinement with Self-Feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko, ...

  23. [23]

    Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When Can LLMsActuallyCorrect Their Own Mistakes? A Critical Survey of Self-Correction of LLMs.Transactions of the Association for Computational Linguistics, 12:1417–1440, November 2024

  24. [24]

    Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs, 2025

    Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, and Lichao Sun. Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs, 2025. Version Number: 2

  25. [25]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C.H. Hoi. CodeRL: mastering code generation through pretrained models and deep reinforcement learning. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. event-place: New Orleans, LA, USA

  26. [26]

    InterCode: standardizing and benchmarking interactive coding with execution feedback

    John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. InterCode: standardizing and benchmarking interactive coding with execution feedback. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. event-place: New Orleans, LA, USA

  27. [27]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. event-place: Vanco...

  28. [28]

    ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding, 2025

    Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, and Bo Zhou. ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding, 2025. Version Number: 1

  29. [29]

    Keyframer: Empowering Animation Design using Large Language Models, 2024

    Tiffany Tseng, Ruijia Cheng, and Jeffrey Nichols. Keyframer: Empowering Animation Design using Large Language Models, 2024

  30. [30]

    VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforce- ment Learning, 2025

    Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforce- ment Learning, 2025. Version Number: 2

  31. [31]

    LogoMotion: Visually-Grounded Code Synthesis for Creating and Editing Animation

    Vivian Liu, Rubaiat Habib Kazi, Li-Yi Wei, Matthew Fisher, Timothy Langlois, Seth Walker, and Lydia Chilton. LogoMotion: Visually-Grounded Code Synthesis for Creating and Editing Animation. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY , USA, 2025. Association for Computing Machinery

  32. [32]

    Manimator: Transforming Research Papers into Visual Explanations, 2025

    Samarth P, Vyoman Jain, Shiva Golugula, and Motamarri Sai Sathvik. Manimator: Transforming Research Papers into Visual Explanations, 2025. Version Number: 1. 21 Training and Agentic Inference Strategies for LLM-based Manim Animation GenerationRammuni Silva et al

  33. [33]

    TheoremExplainAgent: Towards video-based multimodal explanations for llm theorem understanding

    Max Ku, Cheuk Hei Chong, Jonathan Leung, Krish Shah, Alvin Yu, and Wenhu Chen. TheoremExplainAgent: Towards video-based multimodal explanations for llm theorem understanding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 6663–6684, 2025

  34. [34]

    Code2Video: A Code-centric Paradigm for Educational Video Generation, 2025

    Yanzhe Chen, Kevin Qinghong Lin, and Mike Zheng Shou. Code2Video: A Code-centric Paradigm for Educational Video Generation, 2025. Version Number: 1

  35. [35]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

  36. [36]

    Toward accurate dynamic time warping in linear time and space.Intell

    Stan Salvador and Philip Chan. Toward accurate dynamic time warping in linear time and space.Intell. Data Anal., 11(5):561–580, October 2007

  37. [37]

    Understanding R1-Zero-Like Training: A Critical Perspective, 2025

    Zichen Liu, Changyu Chen, and Li and others. Understanding R1-Zero-Like Training: A Critical Perspective, 2025

  38. [38]

    Unsloth, 2023

    Michael Han Daniel Han and Unsloth team. Unsloth, 2023

  39. [39]

    Qwen3 Technical Report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  40. [40]

    The Llama 3 Herd of Models, 2024

    Aaron Grattafiori and Dubey and others. The Llama 3 Herd of Models, 2024. Version Number: 3

  41. [41]

    Liu and Khandelwal and others

    Alexander H. Liu and Khandelwal and others. Ministral 3, 2026. Version Number: 1

  42. [42]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...