pith. machine review for the scientific record. sign in

arxiv: 2604.11792 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords vector animationLottietokenizationautoregressive modelmultimodal generationSVGanimation dataset
0
0 comments X

The pith

LottieGPT tokenizes Lottie vector animations to support autoregressive generation from language and visual prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first system for turning vector animations into token sequences that an autoregressive model can learn to generate. It does this by creating a Lottie Tokenizer that handles geometric layers, transforms, and motion keyframes, along with a dataset of 660,000 real animations. This approach matters because it lets generative AI produce resolution-independent and editable animations instead of pixel-based videos. Experiments confirm the tokenizer shortens sequences effectively and the resulting model generalizes to many styles while beating prior work on single-frame vector cases like SVG.

Core claim

We introduce LottieGPT, the first framework to tokenize and autoregressively generate vector animations. By designing a Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into compact token sequences, and curating the LottieAnimation-660K dataset, we finetune Qwen-VL to produce coherent, editable vector animations directly from natural language or visual prompts, with strong generalization across styles and superior performance on SVG generation.

What carries the argument

The Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence.

If this is right

  • Reduces sequence length while preserving structural fidelity for autoregressive training.
  • Produces coherent, editable vector animations from text or image prompts.
  • Generalizes well to diverse animation styles.
  • Outperforms previous models on SVG generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users could generate base animations and then tweak parameters in standard Lottie editors without starting from scratch.
  • The method opens possibilities for AI-assisted creation of interactive web content and mobile app animations.
  • Extending the tokenizer to handle more complex interactions or longer durations could be tested in follow-up work.

Load-bearing premise

That encoding animations into tokens via the Lottie Tokenizer retains enough information for the model to generate animations that remain structurally accurate, motion-coherent, and fully editable.

What would settle it

Running the generated animations in a standard Lottie player and checking for missing layers, incorrect motion paths, or playback failures compared to the original files.

Figures

Figures reproduced from arXiv: 2604.11792 by Fei Ma, Hao Zhao, Junhao Chen, Kejun Gao, Mingjin Chen, Mingze Sun, Qi Tian, Ruqi Huang, Shaohui Wang, Xiaoxiao Long, Yuehan Cui.

Figure 1
Figure 1. Figure 1: LottieGPT generates editable vector animations from diverse inputs. Unlike existing models that produce fixed-resolution raster videos, we generate resolution-independent vector graphics and animations from text, image, or keyframes. Our outputs scale infinitely, and enable direct editing of shapes and motion. These capabilities are impossible with pixel-based methods. Abstract Despite rapid progress in vi… view at source ↗
Figure 2
Figure 2. Figure 2: Data curation pipeline. We collected 10M SVG resources and 660K After Effects (AE) animation resources from the internet, then converted them to Lottie Json format, filtered them using simplification algorithms that do not affect rendering results, and used QwenVL to generate text labels for vector graphics and vector animations. 12, 13, 28, 30, 34, 42, 43, 49, 94, 98]. Despite their im￾pressive visual qua… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of LottieGPT. LottieGPT is built upon the pre-trained vision-language model Qwen2.5-VL and incorporates a Lottie tokenizer. The model encodes both text and image inputs as prefix tokens, while the Lottie tokenizer encodes vector animation commands into a unified representation space. We first train the model on static Lottie images, followed by training on Lottie animations view at source ↗
Figure 4
Figure 4. Figure 4: Unlike raster pixel-based videos or frame-by-frame saved SVGs, the Lottie Tokenizer only stores view at source ↗
Figure 5
Figure 5. Figure 5: Lottie animations generated by LottieGPT. view at source ↗
Figure 6
Figure 6. Figure 6: Performance of OmniSVG and LottieGPT on the text-to-vector graphics generation task. view at source ↗
Figure 7
Figure 7. Figure 7: Performance of OmniSVG, StarVector, and LottieGPT on the image-to-vector graphics generation task. view at source ↗
Figure 8
Figure 8. Figure 8: For the Text-to-Animation task, all LLM baselines were provided with identical 3-shot examples. view at source ↗
Figure 9
Figure 9. Figure 9: Using Text+Image as input to generate animations. Few-shot refers to providing three description-Lottie JSON data pairs. Except view at source ↗
Figure 10
Figure 10. Figure 10: LottieGPT results on in-the-wild text-only inputs. view at source ↗
Figure 11
Figure 11. Figure 11: LottieGPT results on in-the-wild text-image inputs. view at source ↗
Figure 12
Figure 12. Figure 12: A manually edited Lottie animation where we modified the wing color using LottieLab. view at source ↗
Figure 13
Figure 13. Figure 13: LottieGPT may still generate cases that are renderable view at source ↗
Figure 14
Figure 14. Figure 14: Representative examples from the four categories in LottieSVG-10M dataset. From top to bottom: icon-designer, illustrator, view at source ↗
Figure 15
Figure 15. Figure 15: File size distribution of LottieSVG-10M dataset in the view at source ↗
Figure 16
Figure 16. Figure 16: Word cloud visualization of tag distribution in view at source ↗
Figure 17
Figure 17. Figure 17: The instruction for SVG data captioning. view at source ↗
Figure 18
Figure 18. Figure 18: The instruction for Lottie animation data captioning. view at source ↗
Figure 19
Figure 19. Figure 19: The instruction for text to Lottie image generation. view at source ↗
Figure 20
Figure 20. Figure 20: The instruction for text and image to Lottie image view at source ↗
Figure 21
Figure 21. Figure 21: The instruction for text to Lottie animation generation. view at source ↗
Figure 22
Figure 22. Figure 22: The instruction for text and image to Lottie animation view at source ↗
Figure 23
Figure 23. Figure 23: The instruction for text and video to Lottie animation view at source ↗
Figure 25
Figure 25. Figure 25: Top 8 most common Bezier easing curves in Lottie animations, covering 75.4% of all usage. Each curve is defined by control ´ points P1 = (ox, oy) and P2 = (ix, iy), with fixed endpoints at (0, 0) and (1, 1). Green dotted line shows linear interpolation for comparison. The legend on the right provides dataset statistics and visual element descriptions. The third most common pattern (15.78%) uses control po… view at source ↗
Figure 26
Figure 26. Figure 26: Bounce easing animation demonstrating extreme undershoot. (a) B view at source ↗
read the original abstract

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces LottieGPT as the first framework for tokenizing and autoregressively generating vector animations using the Lottie JSON standard. It proposes a custom Lottie Tokenizer to encode layered geometric primitives, transforms, and keyframe-based motion into compact, semantically aligned token sequences; curates the LottieAnimation-660K dataset (660k real-world animations plus 15M static Lottie images); and fine-tunes Qwen-VL to enable generation of coherent, editable animations from natural language or visual prompts. The work claims that the tokenizer reduces sequence length while preserving structural fidelity, supports effective autoregressive learning, exhibits strong generalization across styles, and outperforms prior SOTA on SVG generation (treated as single-frame vector animation).

Significance. If the tokenizer's fidelity claims hold and the reported outperformance is substantiated, the work would mark a meaningful step toward native vector animation generation with LLMs. This could advance beyond raster video models by enabling resolution-independent, compact, and parametrically editable outputs, building on trends in structured data generation (e.g., meshes, layouts) and offering practical value for web multimedia and design applications.

major comments (2)
  1. [Abstract] Abstract: The central claims that 'experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content' and that 'LottieGPT ... outperforms previous state-of-the-art models on SVG generation' are unsupported by any referenced quantitative metrics, baselines, tables, reconstruction errors, or ablation results. This directly affects evaluation of whether the tokenizer preserves all necessary motion and structural information for coherent, editable output.
  2. [Lottie Tokenizer and Experiments] Lottie Tokenizer and Experiments sections: No reconstruction metrics, information-loss bounds, timing accuracy measures, layer-ordering fidelity scores, or ablations on parametric motion recovery are provided to validate that the tokenizer encodes 'layered geometric primitives, transforms, and keyframe-based motion' without critical loss. If keyframe timing or transform data is discarded, the fine-tuned Qwen-VL model cannot reliably produce valid Lottie outputs, undermining the generalization and editability assertions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative validation of the Lottie Tokenizer. We agree that the current presentation of results can be improved for clarity and completeness. We will revise the manuscript to address these points directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims that 'experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content' and that 'LottieGPT ... outperforms previous state-of-the-art models on SVG generation' are unsupported by any referenced quantitative metrics, baselines, tables, reconstruction errors, or ablation results. This directly affects evaluation of whether the tokenizer preserves all necessary motion and structural information for coherent, editable output.

    Authors: We acknowledge that the abstract would be strengthened by explicit references to quantitative results. The Experiments section (Section 4) presents sequence length comparisons and SVG generation benchmarks against baselines, but we agree these should be cited directly in the abstract. In the revision, we will update the abstract to reference specific metrics (e.g., average sequence length reduction and generation quality scores from Tables 1 and 2) and include a brief mention of reconstruction fidelity measures. revision: yes

  2. Referee: [Lottie Tokenizer and Experiments] Lottie Tokenizer and Experiments sections: No reconstruction metrics, information-loss bounds, timing accuracy measures, layer-ordering fidelity scores, or ablations on parametric motion recovery are provided to validate that the tokenizer encodes 'layered geometric primitives, transforms, and keyframe-based motion' without critical loss. If keyframe timing or transform data is discarded, the fine-tuned Qwen-VL model cannot reliably produce valid Lottie outputs, undermining the generalization and editability assertions.

    Authors: We agree that detailed quantitative validation of the tokenizer is essential. While the manuscript includes qualitative demonstrations of reconstruction and some aggregate metrics in the Experiments section, we recognize the absence of specific ablations on timing accuracy, layer-ordering fidelity, and parametric motion recovery. In the revised version, we will add a dedicated subsection with these metrics, including reconstruction errors for keyframe timing and transforms, layer-ordering scores, and ablations showing the impact of motion parameter encoding. This will substantiate the claims regarding fidelity and editability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel tokenizer design and external fine-tuning

full rationale

The paper's core chain consists of designing a new Lottie Tokenizer, curating an external LottieAnimation-660K dataset from internet sources, and fine-tuning the independent Qwen-VL model. No equations, self-citations, or claims reduce the tokenizer fidelity, sequence reduction, or generation performance to fitted parameters or self-definitions by construction. The abstract and claims treat the tokenizer as an explicit engineering contribution whose information preservation is asserted empirically rather than defined into existence. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the tokenizer successfully compressing Lottie structure into tokens that an autoregressive model can learn from, plus the assumption that the curated internet-sourced dataset captures sufficient diversity for generalization.

axioms (1)
  • domain assumption Lottie JSON structure can be tokenized into a compact sequence that preserves layered geometry, transforms, and keyframe motion for autoregressive modeling
    Invoked in the design of the Lottie Tokenizer and the claim that it enables effective learning.
invented entities (2)
  • Lottie Tokenizer no independent evidence
    purpose: Encodes Lottie animations into semantically aligned token sequences
    New component introduced to bridge vector animation to language model input format.
  • LottieAnimation-660K dataset no independent evidence
    purpose: Provides large-scale training data of real-world Lottie animations and static images
    Newly curated collection claimed to be the largest and most diverse for this task.

pith-pipeline@v0.9.0 · 5614 in / 1280 out tokens · 46410 ms · 2026-05-10T15:21:21.144193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

125 extracted references · 24 canonical work pages · 10 internal anchors

  1. [1]

    Generating cad code with vision-language models for 3d designs

    Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Haider Zaidi, Megan Langwasser, Wei Xu, and Matthew Gombo- lay. Generating cad code with vision-language models for 3d designs. InThe Thirteenth International Conference on Learning Representations. 3

  2. [2]

    Claude sonnet 4.5, 2025

    Anthropic. Claude sonnet 4.5, 2025. 7

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 10

  4. [4]

    Instructpix2pix: Learning to follow image editing instruc- tions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instruc- tions. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 18392–18402,

  5. [5]

    Video world simulators.OpenAI Technical Report, 2024

    Tim Brooks, Aleksander Holynski, Jiaming Wu, Ke Li, and OpenAI Team. Video world simulators.OpenAI Technical Report, 2024. 2

  6. [6]

    Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020

    Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020. 3, 4

  7. [7]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 6

  8. [8]

    Pix2video: Video editing using image diffusion

    Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023. 3

  9. [9]

    Soulstyler: Using large language model to guide image style transfer for target object.arXiv preprint arXiv:2311.13562, 2023

    Junhao Chen, Peng Rong, Jingbo Sun, Chao Li, Xiang Li, and Hongwu Lv. Soulstyler: Using large language model to guide image style transfer for target object.arXiv preprint arXiv:2311.13562, 2023. 3

  10. [10]

    Idea23d: Collaborative lmm agents enable 3d model generation from interleaved multimodal inputs

    Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, and Hao Zhao. Idea23d: Collaborative lmm agents enable 3d model generation from interleaved multimodal inputs. In Proceedings of the 31st International Conference on Com- putational Linguistics, pages 4149–4166, 2025. 3

  11. [11]

    V2m4: 4d mesh animation reconstruction from a single monocular video

    Jianqi Chen, Biao Zhang, Xiangjun Tang, and Peter Wonka. V2m4: 4d mesh animation reconstruction from a single monocular video. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 11643–11653, 2025. 3

  12. [12]

    Dancetogether: Generating interactive multi-person video without identity drifting

    Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiao-Xiao Long, and Ruqi Huang. Dancetogether: Generating interactive multi-person video without identity drifting. InThe Fourteenth International Conference on Learning Representations, 2026. 3

  13. [13]

    HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

    Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang, Lili Wang, Yawen Cui, Lap-Pui Chau, and Yi Wang. Hvg-3d: Bridging real and simulation domains for 3d-conditional hand-object interaction video synthesis. arXiv preprint arXiv:2604.03305, 2026. 3

  14. [14]

    Ultraman: ultra-fast and high-resolution texture generation for 3d human recon- struction from a single image.Machine Vision and Appli- cations, 37(2):24, 2026

    Mingjin Chen, Junhao Chen, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, and Hao Zhao. Ultraman: ultra-fast and high-resolution texture generation for 3d human recon- struction from a single image.Machine Vision and Appli- cations, 37(2):24, 2026. 3

  15. [15]

    Svgenius: Benchmarking llms in svg understanding, editing and generation

    Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, et al. Svgenius: Benchmarking llms in svg understanding, editing and generation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13289–13296, 2025. 6

  16. [16]

    Svgbuilder: Component- based colored svg generation with text-guided autoregres- sive transformers

    Zehao Chen and Rong Pan. Svgbuilder: Component- based colored svg generation with text-guided autoregres- sive transformers. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2358–2366, 2025. 3

  17. [17]

    Impromptu vla: Open weights and open data for driving vision-language-action models

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 3

  18. [18]

    Motionlcm: Real-time control- lable motion generation via latent consistency model

    Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time control- lable motion generation via latent consistency model. In European Conference on Computer Vision, pages 390–408. Springer, 2024. 3

  19. [19]

    Vector graphics animation with time-varying topology

    Boris Dalstein, R ´emi Ronfard, and Michiel Van De Panne. Vector graphics animation with time-varying topology. ACM Transactions on Graphics (TOG), 34(4):1–12, 2015. 3

  20. [20]

    Gemini 2.5 pro, 2025

    Google DeepMind. Gemini 2.5 pro, 2025. 7

  21. [21]

    Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving

    Kairui Ding, Boyuan Chen, Yuchen Su, Huan-ang Gao, Bu Jin, Chonghao Sima, Xiaohui Li, Wuqiang Zhang, Paul Barsch, Hongyang Li, et al. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving. In8th Annual Conference on Robot Learning. 3

  22. [22]

    Hierarchically recognizing vector graphics and a new chart-based vector graphics dataset

    Shuguang Dou, Xinyang Jiang, Lu Liu, Lu Ying, Caihua Shan, Yifei Shen, Xuanyi Dong, Yun Wang, Dongsheng Li, and Cairong Zhao. Hierarchically recognizing vector graphics and a new chart-based vector graphics dataset. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 46(12):7556–7573, 2024. 3

  23. [23]

    Go to zero: Towards zero-shot motion gen- eration with million-scale data

    Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion gen- eration with million-scale data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025. 3

  24. [24]

    Breath- ing life into sketches using text-to-video priors

    Rinon Gal, Yael Vinker, Yuval Alaluf, Amit Bermano, Daniel Cohen-Or, Ariel Shamir, and Gal Chechik. Breath- ing life into sketches using text-to-video priors. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4325–4336, 2024. 3, 8

  25. [25]

    Me- shart: Generating articulated meshes with structure-guided transformers

    Daoyi Gao, Yawar Siddiqui, Lei Li, and Angela Dai. Me- shart: Generating articulated meshes with structure-guided transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 618–627, 2025. 3

  26. [26]

    Linr bridge: Vector graphic animation via neu- ral implicits and video diffusion priors.arXiv preprint arXiv:2509.07484, 2025

    Wenshuo Gao, Xicheng Lan, Luyao Zhang, and Shuai Yang. Linr bridge: Vector graphic animation via neu- ral implicits and video diffusion priors.arXiv preprint arXiv:2509.07484, 2025. 3, 8

  27. [27]

    Autopresent: De- signing structured visuals from scratch

    Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, et al. Autopresent: De- signing structured visuals from scratch. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2902–2911, 2025. 1, 2

  28. [28]

    Veo 3.1: Video generation, 2025

    Google. Veo 3.1: Video generation, 2025. 3, 7, 13

  29. [29]

    IW-bench: Evaluating large multimodal models for converting image- to-web

    Hongcheng Guo, Wei Zhang, Junhao Chen, Yaonan Gu, Jian Yang, Junjia Du, Shaosheng Cao, Binyuan Hui, Tianyu Liu, Jianxin Ma, Chang Zhou, and Zhoujun Li. IW-bench: Evaluating large multimodal models for converting image- to-web. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6449–6466, Vienna, Austria,

  30. [30]

    Association for Computational Linguistics. 6

  31. [31]

    Animatediff: Animate your personalized text- to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representa- tions. 3

  32. [32]

    Make-it-animatable: An ef- ficient framework for authoring animation-ready 3d char- acters

    Zhiyang Guo, Jinxu Xiang, Kai Ma, Wengang Zhou, Houqiang Li, and Ran Zhang. Make-it-animatable: An ef- ficient framework for authoring animation-ready 3d char- acters. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10783–10792, 2025. 3

  33. [33]

    Atom: Aligning text-to-motion model at event-level with gpt-4vision reward

    Haonan Han, Xiangzuo Wu, Huan Liao, Zunnan Xu, Zhongyuan Hu, Ronghui Li, Yachao Zhang, and Xiu Li. Atom: Aligning text-to-motion model at event-level with gpt-4vision reward. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22746–22755,

  34. [34]

    arXiv preprint arXiv:2412.09548 , year=

    Zekun Hao, David W Romero, Tsung-Yi Lin, and Ming-Yu Liu. Meshtron: High-fidelity, artist-like 3d mesh generation at scale.arXiv preprint arXiv:2412.09548, 2024. 3

  35. [35]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 3

  36. [36]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations. 3

  37. [37]

    Word-as-image for semantic typography.ACM Transactions on Graphics (TOG), 42(4): 1–11, 2023

    Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for semantic typography.ACM Transactions on Graphics (TOG), 42(4): 1–11, 2023. 3

  38. [38]

    Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1911–1920, 2023. 3

  39. [39]

    Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Sys- tems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Sys- tems, 36:20067–20079, 2023. 3

  40. [40]

    Vace: All-in-one video cre- ation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video cre- ation and editing. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 17191– 17202, 2025. 3

  41. [41]

    Tod3cap: Towards 3d dense captioning in outdoor scenes

    Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, et al. Tod3cap: Towards 3d dense captioning in outdoor scenes. InEuropean Conference on Computer Vision, pages 367–384. Springer, 2024. 3

  42. [42]

    Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

    Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal. Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024. 3

  43. [43]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

  44. [44]

    Kling, 2025

    Kuaishou. Kling, 2025. 2, 3, 7, 13

  45. [45]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025. 2

  46. [46]

    Hunyuan3d studio: End-to-end ai pipeline for game-ready 3d asset generation.arXiv preprint arXiv:2509.12815, 2025

    Biwen Lei, Yang Li, Xinhai Liu, Shuhui Yang, Lixin Xu, Jingwei Huang, Ruining Tang, Haohan Weng, Jian Liu, Jing Xu, et al. Hunyuan3d studio: End-to-end ai pipeline for game-ready 3d asset generation.arXiv preprint arXiv:2509.12815, 2025. 3

  47. [47]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 4, 10

  48. [48]

    Toist: Task oriented instance segmentation transformer with noun- pronoun distillation.Advances in Neural Information Pro- cessing Systems, 35:17597–17611, 2022

    Pengfei Li, Beiwen Tian, Yongliang Shi, Xiaoxue Chen, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Toist: Task oriented instance segmentation transformer with noun- pronoun distillation.Advances in Neural Information Pro- cessing Systems, 35:17597–17611, 2022. 3

  49. [49]

    Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020

    Tzu-Mao Li, Michal Luk ´aˇc, Micha¨el Gharbi, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020. 3

  50. [50]

    Interanimate: Taming region-aware diffu- sion model for realistic human interaction animation

    Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Hui- jia Zhu, et al. Interanimate: Taming region-aware diffu- sion model for realistic human interaction animation. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10305–10314, 2025. 3

  51. [51]

    Treemeshgpt: Artistic mesh generation with autoregressive tree sequencing

    Stefan Lionar, Jiabin Liang, and Gim Hee Lee. Treemeshgpt: Artistic mesh generation with autoregressive tree sequencing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26608–26617,

  52. [52]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 8

  53. [53]

    Video-p2p: Video editing with cross-attention con- trol

    Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Ji- aya Jia. Video-p2p: Video editing with cross-attention con- trol. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 8599–8608,

  54. [54]

    Hola: B-rep gener- ation using a holistic latent representation.ACM Transac- tions on Graphics (TOG), 44(4):1–25, 2025

    Yilin Liu, Duoteng Xu, Xingyao Yu, Xiang Xu, Daniel Cohen-Or, Hao Zhang, and Hui Huang. Hola: B-rep gener- ation using a holistic latent representation.ACM Transac- tions on Graphics (TOG), 44(4):1–25, 2025. 3

  55. [55]

    Dynamic typog- raphy: Bringing text to life via video diffusion prior

    Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, and Huamin Qu. Dynamic typog- raphy: Bringing text to life via video diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14787–14797, 2025. 3

  56. [56]

    Wonder3d: Single image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 3

  57. [57]

    Mover: Motion verifica- tion for motion graphics animations.ACM Transactions on Graphics (TOG), 44(4):1–17, 2025

    Jiaju Ma and Maneesh Agrawala. Mover: Motion verifica- tion for motion graphics animations.ACM Transactions on Graphics (TOG), 44(4):1–17, 2025. 3

  58. [58]

    arXiv preprint arXiv:2506.07491 , year=

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor mod- eling.arXiv preprint arXiv:2506.07491, 2025. 1, 2

  59. [59]

    Animatesvg: autonomous creation and aesthetics evaluation of scalable vector graph- ics animations for the case of brand logos

    Deborah Mateja, Rebecca Armbruster, Jonathan Baumert, Tim Bleil, Jakob Langenbahn, Jan Christian Schwedhelm, Sarah Sester, and Armin Heinzl. Animatesvg: autonomous creation and aesthetics evaluation of scalable vector graph- ics animations for the case of brand logos. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15710–15716, 2023. 3

  60. [60]

    From frames to sequences: Tem- porally consistent human-centric dense prediction.arXiv preprint arXiv:2602.01661, 2026

    Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Jun- hao Chen, and Yang Long. From frames to sequences: Tem- porally consistent human-centric dense prediction.arXiv preprint arXiv:2602.01661, 2026. 3

  61. [61]

    Svgeditbench: A benchmark dataset for quantitative assessment of llm’s svg editing capabilities

    Kunato Nishina and Yusuke Matsui. Svgeditbench: A benchmark dataset for quantitative assessment of llm’s svg editing capabilities. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8142–8147, 2024. 6

  62. [62]

    Gpt-5, 2025

    OpenAI. Gpt-5, 2025. 7

  63. [63]

    Sora 2 system card, 2025

    OpenAI. Sora 2 system card, 2025. 2, 7, 13

  64. [64]

    Generating physically sta- ble and buildable brick structures from text

    Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically sta- ble and buildable brick structures from text. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 14798–14809, 2025. 1, 2, 3

  65. [65]

    Lhm: Large animatable human reconstruction model for single image to 3d in seconds

    Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, et al. Lhm: Large animatable human reconstruction model for single image to 3d in seconds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14184–14194, 2025. 3

  66. [66]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6, 7

  67. [67]

    Im2vec: Synthesizing vector graphics with- out vector supervision

    Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graphics with- out vector supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7342–7351, 2021. 3

  68. [68]

    Starvector: Gen- erating scalable vector graphics code from images and text

    Juan A Rodriguez, Abhay Puri, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Gen- erating scalable vector graphics code from images and text. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 16175–16186, 2025. 4, 7, 2, 13

  69. [69]

    Starvector: Gen- erating scalable vector graphics code from images and text

    Juan A Rodriguez, Abhay Puri, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Gen- erating scalable vector graphics code from images and text. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 16175–16186, 2025. 3

  70. [70]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2

  71. [71]

    Meshgpt: Generating triangle meshes with decoder-only transformers

    Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024. 3

  72. [72]

    arXiv preprint arXiv:2508.10898 , year=

    Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei, Fayao Liu, Jiashi Feng, Guosheng Lin, and Jianfeng Zhang. Puppeteer: Rig and animate your 3d models.arXiv preprint arXiv:2508.10898, 2025. 3

  73. [73]

    Magicarticulate: Make your 3d models articulation-ready

    Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, et al. Magicarticulate: Make your 3d models articulation-ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998– 16007, 2025

  74. [74]

    Drive: Diffusion-based rigging em- powers generation of versatile and expressive characters

    Mingze Sun, Junhao Chen, Junting Dong, Yurun Chen, Xinyu Jiang, Shiwei Mao, Puhua Jiang, Jingbo Wang, Bo Dai, and Ruqi Huang. Drive: Diffusion-based rigging em- powers generation of versatile and expressive characters. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 21170–21180, 2025. 3

  75. [75]

    Edgerunner: Auto-regressive auto-encoder for artistic mesh generation

    Jiaxiang Tang, Zhaoshuo Li, Zekun Hao, Xian Liu, Gang Zeng, Ming-Yu Liu, and Qinsheng Zhang. Edgerunner: Auto-regressive auto-encoder for artistic mesh generation. InThe Thirteenth International Conference on Learning Representations. 1, 2, 3

  76. [76]

    Lgm: Large multi- view gaussian model for high-resolution 3d content cre- ation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi- view gaussian model for high-resolution 3d content cre- ation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 3

  77. [77]

    Keyframer: Empowering animation design using large language models

    Tiffany Tseng, Ruijia Cheng, and Jeffrey Nichols. Keyframer: Empowering animation design using large lan- guage models.arXiv preprint arXiv:2402.06071, 2024. 3

  78. [78]

    N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

    Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025. 3

  79. [79]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image pro- cessing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image pro- cessing, 13(4):600–612, 2004. 6

  80. [80]

    arXiv preprint arXiv:2411.09595 , year=

    Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiaohui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024. 3

Showing first 80 references.