Recognition: unknown
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3
The pith
LottieGPT tokenizes Lottie vector animations to support autoregressive generation from language and visual prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce LottieGPT, the first framework to tokenize and autoregressively generate vector animations. By designing a Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into compact token sequences, and curating the LottieAnimation-660K dataset, we finetune Qwen-VL to produce coherent, editable vector animations directly from natural language or visual prompts, with strong generalization across styles and superior performance on SVG generation.
What carries the argument
The Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence.
If this is right
- Reduces sequence length while preserving structural fidelity for autoregressive training.
- Produces coherent, editable vector animations from text or image prompts.
- Generalizes well to diverse animation styles.
- Outperforms previous models on SVG generation.
Where Pith is reading between the lines
- Users could generate base animations and then tweak parameters in standard Lottie editors without starting from scratch.
- The method opens possibilities for AI-assisted creation of interactive web content and mobile app animations.
- Extending the tokenizer to handle more complex interactions or longer durations could be tested in follow-up work.
Load-bearing premise
That encoding animations into tokens via the Lottie Tokenizer retains enough information for the model to generate animations that remain structurally accurate, motion-coherent, and fully editable.
What would settle it
Running the generated animations in a standard Lottie player and checking for missing layers, incorrect motion paths, or playback failures compared to the original files.
Figures
read the original abstract
Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LottieGPT as the first framework for tokenizing and autoregressively generating vector animations using the Lottie JSON standard. It proposes a custom Lottie Tokenizer to encode layered geometric primitives, transforms, and keyframe-based motion into compact, semantically aligned token sequences; curates the LottieAnimation-660K dataset (660k real-world animations plus 15M static Lottie images); and fine-tunes Qwen-VL to enable generation of coherent, editable animations from natural language or visual prompts. The work claims that the tokenizer reduces sequence length while preserving structural fidelity, supports effective autoregressive learning, exhibits strong generalization across styles, and outperforms prior SOTA on SVG generation (treated as single-frame vector animation).
Significance. If the tokenizer's fidelity claims hold and the reported outperformance is substantiated, the work would mark a meaningful step toward native vector animation generation with LLMs. This could advance beyond raster video models by enabling resolution-independent, compact, and parametrically editable outputs, building on trends in structured data generation (e.g., meshes, layouts) and offering practical value for web multimedia and design applications.
major comments (2)
- [Abstract] Abstract: The central claims that 'experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content' and that 'LottieGPT ... outperforms previous state-of-the-art models on SVG generation' are unsupported by any referenced quantitative metrics, baselines, tables, reconstruction errors, or ablation results. This directly affects evaluation of whether the tokenizer preserves all necessary motion and structural information for coherent, editable output.
- [Lottie Tokenizer and Experiments] Lottie Tokenizer and Experiments sections: No reconstruction metrics, information-loss bounds, timing accuracy measures, layer-ordering fidelity scores, or ablations on parametric motion recovery are provided to validate that the tokenizer encodes 'layered geometric primitives, transforms, and keyframe-based motion' without critical loss. If keyframe timing or transform data is discarded, the fine-tuned Qwen-VL model cannot reliably produce valid Lottie outputs, undermining the generalization and editability assertions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger quantitative validation of the Lottie Tokenizer. We agree that the current presentation of results can be improved for clarity and completeness. We will revise the manuscript to address these points directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that 'experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content' and that 'LottieGPT ... outperforms previous state-of-the-art models on SVG generation' are unsupported by any referenced quantitative metrics, baselines, tables, reconstruction errors, or ablation results. This directly affects evaluation of whether the tokenizer preserves all necessary motion and structural information for coherent, editable output.
Authors: We acknowledge that the abstract would be strengthened by explicit references to quantitative results. The Experiments section (Section 4) presents sequence length comparisons and SVG generation benchmarks against baselines, but we agree these should be cited directly in the abstract. In the revision, we will update the abstract to reference specific metrics (e.g., average sequence length reduction and generation quality scores from Tables 1 and 2) and include a brief mention of reconstruction fidelity measures. revision: yes
-
Referee: [Lottie Tokenizer and Experiments] Lottie Tokenizer and Experiments sections: No reconstruction metrics, information-loss bounds, timing accuracy measures, layer-ordering fidelity scores, or ablations on parametric motion recovery are provided to validate that the tokenizer encodes 'layered geometric primitives, transforms, and keyframe-based motion' without critical loss. If keyframe timing or transform data is discarded, the fine-tuned Qwen-VL model cannot reliably produce valid Lottie outputs, undermining the generalization and editability assertions.
Authors: We agree that detailed quantitative validation of the tokenizer is essential. While the manuscript includes qualitative demonstrations of reconstruction and some aggregate metrics in the Experiments section, we recognize the absence of specific ablations on timing accuracy, layer-ordering fidelity, and parametric motion recovery. In the revised version, we will add a dedicated subsection with these metrics, including reconstruction errors for keyframe timing and transforms, layer-ordering scores, and ablations showing the impact of motion parameter encoding. This will substantiate the claims regarding fidelity and editability. revision: yes
Circularity Check
No significant circularity; derivation relies on novel tokenizer design and external fine-tuning
full rationale
The paper's core chain consists of designing a new Lottie Tokenizer, curating an external LottieAnimation-660K dataset from internet sources, and fine-tuning the independent Qwen-VL model. No equations, self-citations, or claims reduce the tokenizer fidelity, sequence reduction, or generation performance to fitted parameters or self-definitions by construction. The abstract and claims treat the tokenizer as an explicit engineering contribution whose information preservation is asserted empirically rather than defined into existence. This is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lottie JSON structure can be tokenized into a compact sequence that preserves layered geometry, transforms, and keyframe motion for autoregressive modeling
invented entities (2)
-
Lottie Tokenizer
no independent evidence
-
LottieAnimation-660K dataset
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Generating cad code with vision-language models for 3d designs
Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Haider Zaidi, Megan Langwasser, Wei Xu, and Matthew Gombo- lay. Generating cad code with vision-language models for 3d designs. InThe Thirteenth International Conference on Learning Representations. 3
-
[2]
Claude sonnet 4.5, 2025
Anthropic. Claude sonnet 4.5, 2025. 7
2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Instructpix2pix: Learning to follow image editing instruc- tions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instruc- tions. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 18392–18402,
-
[5]
Video world simulators.OpenAI Technical Report, 2024
Tim Brooks, Aleksander Holynski, Jiaming Wu, Ke Li, and OpenAI Team. Video world simulators.OpenAI Technical Report, 2024. 2
2024
-
[6]
Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020
Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020. 3, 4
2020
-
[7]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 6
2021
-
[8]
Pix2video: Video editing using image diffusion
Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023. 3
2023
-
[9]
Junhao Chen, Peng Rong, Jingbo Sun, Chao Li, Xiang Li, and Hongwu Lv. Soulstyler: Using large language model to guide image style transfer for target object.arXiv preprint arXiv:2311.13562, 2023. 3
-
[10]
Idea23d: Collaborative lmm agents enable 3d model generation from interleaved multimodal inputs
Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, and Hao Zhao. Idea23d: Collaborative lmm agents enable 3d model generation from interleaved multimodal inputs. In Proceedings of the 31st International Conference on Com- putational Linguistics, pages 4149–4166, 2025. 3
2025
-
[11]
V2m4: 4d mesh animation reconstruction from a single monocular video
Jianqi Chen, Biao Zhang, Xiangjun Tang, and Peter Wonka. V2m4: 4d mesh animation reconstruction from a single monocular video. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 11643–11653, 2025. 3
2025
-
[12]
Dancetogether: Generating interactive multi-person video without identity drifting
Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiao-Xiao Long, and Ruqi Huang. Dancetogether: Generating interactive multi-person video without identity drifting. InThe Fourteenth International Conference on Learning Representations, 2026. 3
2026
-
[13]
Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang, Lili Wang, Yawen Cui, Lap-Pui Chau, and Yi Wang. Hvg-3d: Bridging real and simulation domains for 3d-conditional hand-object interaction video synthesis. arXiv preprint arXiv:2604.03305, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Ultraman: ultra-fast and high-resolution texture generation for 3d human recon- struction from a single image.Machine Vision and Appli- cations, 37(2):24, 2026
Mingjin Chen, Junhao Chen, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, and Hao Zhao. Ultraman: ultra-fast and high-resolution texture generation for 3d human recon- struction from a single image.Machine Vision and Appli- cations, 37(2):24, 2026. 3
2026
-
[15]
Svgenius: Benchmarking llms in svg understanding, editing and generation
Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, et al. Svgenius: Benchmarking llms in svg understanding, editing and generation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13289–13296, 2025. 6
2025
-
[16]
Svgbuilder: Component- based colored svg generation with text-guided autoregres- sive transformers
Zehao Chen and Rong Pan. Svgbuilder: Component- based colored svg generation with text-guided autoregres- sive transformers. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2358–2366, 2025. 3
2025
-
[17]
Impromptu vla: Open weights and open data for driving vision-language-action models
Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 3
-
[18]
Motionlcm: Real-time control- lable motion generation via latent consistency model
Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time control- lable motion generation via latent consistency model. In European Conference on Computer Vision, pages 390–408. Springer, 2024. 3
2024
-
[19]
Vector graphics animation with time-varying topology
Boris Dalstein, R ´emi Ronfard, and Michiel Van De Panne. Vector graphics animation with time-varying topology. ACM Transactions on Graphics (TOG), 34(4):1–12, 2015. 3
2015
-
[20]
Gemini 2.5 pro, 2025
Google DeepMind. Gemini 2.5 pro, 2025. 7
2025
-
[21]
Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving
Kairui Ding, Boyuan Chen, Yuchen Su, Huan-ang Gao, Bu Jin, Chonghao Sima, Xiaohui Li, Wuqiang Zhang, Paul Barsch, Hongyang Li, et al. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving. In8th Annual Conference on Robot Learning. 3
-
[22]
Hierarchically recognizing vector graphics and a new chart-based vector graphics dataset
Shuguang Dou, Xinyang Jiang, Lu Liu, Lu Ying, Caihua Shan, Yifei Shen, Xuanyi Dong, Yun Wang, Dongsheng Li, and Cairong Zhao. Hierarchically recognizing vector graphics and a new chart-based vector graphics dataset. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 46(12):7556–7573, 2024. 3
2024
-
[23]
Go to zero: Towards zero-shot motion gen- eration with million-scale data
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion gen- eration with million-scale data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025. 3
2025
-
[24]
Breath- ing life into sketches using text-to-video priors
Rinon Gal, Yael Vinker, Yuval Alaluf, Amit Bermano, Daniel Cohen-Or, Ariel Shamir, and Gal Chechik. Breath- ing life into sketches using text-to-video priors. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4325–4336, 2024. 3, 8
2024
-
[25]
Me- shart: Generating articulated meshes with structure-guided transformers
Daoyi Gao, Yawar Siddiqui, Lei Li, and Angela Dai. Me- shart: Generating articulated meshes with structure-guided transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 618–627, 2025. 3
2025
-
[26]
Wenshuo Gao, Xicheng Lan, Luyao Zhang, and Shuai Yang. Linr bridge: Vector graphic animation via neu- ral implicits and video diffusion priors.arXiv preprint arXiv:2509.07484, 2025. 3, 8
-
[27]
Autopresent: De- signing structured visuals from scratch
Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, et al. Autopresent: De- signing structured visuals from scratch. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2902–2911, 2025. 1, 2
2025
-
[28]
Veo 3.1: Video generation, 2025
Google. Veo 3.1: Video generation, 2025. 3, 7, 13
2025
-
[29]
IW-bench: Evaluating large multimodal models for converting image- to-web
Hongcheng Guo, Wei Zhang, Junhao Chen, Yaonan Gu, Jian Yang, Junjia Du, Shaosheng Cao, Binyuan Hui, Tianyu Liu, Jianxin Ma, Chang Zhou, and Zhoujun Li. IW-bench: Evaluating large multimodal models for converting image- to-web. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6449–6466, Vienna, Austria,
2025
-
[30]
Association for Computational Linguistics. 6
-
[31]
Animatediff: Animate your personalized text- to-image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representa- tions. 3
-
[32]
Make-it-animatable: An ef- ficient framework for authoring animation-ready 3d char- acters
Zhiyang Guo, Jinxu Xiang, Kai Ma, Wengang Zhou, Houqiang Li, and Ran Zhang. Make-it-animatable: An ef- ficient framework for authoring animation-ready 3d char- acters. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10783–10792, 2025. 3
2025
-
[33]
Atom: Aligning text-to-motion model at event-level with gpt-4vision reward
Haonan Han, Xiangzuo Wu, Huan Liao, Zunnan Xu, Zhongyuan Hu, Ronghui Li, Yachao Zhang, and Xiu Li. Atom: Aligning text-to-motion model at event-level with gpt-4vision reward. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22746–22755,
-
[34]
arXiv preprint arXiv:2412.09548 , year=
Zekun Hao, David W Romero, Tsung-Yi Lin, and Ming-Yu Liu. Meshtron: High-fidelity, artist-like 3d mesh generation at scale.arXiv preprint arXiv:2412.09548, 2024. 3
-
[35]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 3
work page internal anchor Pith review arXiv 2022
-
[36]
Lrm: Large reconstruction model for single image to 3d
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations. 3
-
[37]
Word-as-image for semantic typography.ACM Transactions on Graphics (TOG), 42(4): 1–11, 2023
Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for semantic typography.ACM Transactions on Graphics (TOG), 42(4): 1–11, 2023. 3
2023
-
[38]
Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models
Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1911–1920, 2023. 3
1911
-
[39]
Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Sys- tems, 36:20067–20079, 2023
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Sys- tems, 36:20067–20079, 2023. 3
2023
-
[40]
Vace: All-in-one video cre- ation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video cre- ation and editing. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 17191– 17202, 2025. 3
2025
-
[41]
Tod3cap: Towards 3d dense captioning in outdoor scenes
Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, et al. Tod3cap: Towards 3d dense captioning in outdoor scenes. InEuropean Conference on Computer Vision, pages 367–384. Springer, 2024. 3
2024
-
[42]
Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024
Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal. Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024. 3
2024
-
[43]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Kling, 2025
Kuaishou. Kling, 2025. 2, 3, 7, 13
2025
-
[45]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[46]
Biwen Lei, Yang Li, Xinhai Liu, Shuhui Yang, Lixin Xu, Jingwei Huang, Ruining Tang, Haohan Weng, Jian Liu, Jing Xu, et al. Hunyuan3d studio: End-to-end ai pipeline for game-ready 3d asset generation.arXiv preprint arXiv:2509.12815, 2025. 3
-
[47]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 4, 10
2023
-
[48]
Toist: Task oriented instance segmentation transformer with noun- pronoun distillation.Advances in Neural Information Pro- cessing Systems, 35:17597–17611, 2022
Pengfei Li, Beiwen Tian, Yongliang Shi, Xiaoxue Chen, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Toist: Task oriented instance segmentation transformer with noun- pronoun distillation.Advances in Neural Information Pro- cessing Systems, 35:17597–17611, 2022. 3
2022
-
[49]
Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020
Tzu-Mao Li, Michal Luk ´aˇc, Micha¨el Gharbi, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020. 3
2020
-
[50]
Interanimate: Taming region-aware diffu- sion model for realistic human interaction animation
Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Hui- jia Zhu, et al. Interanimate: Taming region-aware diffu- sion model for realistic human interaction animation. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10305–10314, 2025. 3
2025
-
[51]
Treemeshgpt: Artistic mesh generation with autoregressive tree sequencing
Stefan Lionar, Jiabin Liang, and Gim Hee Lee. Treemeshgpt: Artistic mesh generation with autoregressive tree sequencing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26608–26617,
-
[52]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Video-p2p: Video editing with cross-attention con- trol
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Ji- aya Jia. Video-p2p: Video editing with cross-attention con- trol. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 8599–8608,
-
[54]
Hola: B-rep gener- ation using a holistic latent representation.ACM Transac- tions on Graphics (TOG), 44(4):1–25, 2025
Yilin Liu, Duoteng Xu, Xingyao Yu, Xiang Xu, Daniel Cohen-Or, Hao Zhang, and Hui Huang. Hola: B-rep gener- ation using a holistic latent representation.ACM Transac- tions on Graphics (TOG), 44(4):1–25, 2025. 3
2025
-
[55]
Dynamic typog- raphy: Bringing text to life via video diffusion prior
Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, and Huamin Qu. Dynamic typog- raphy: Bringing text to life via video diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14787–14797, 2025. 3
2025
-
[56]
Wonder3d: Single image to 3d using cross-domain diffusion
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 3
2024
-
[57]
Mover: Motion verifica- tion for motion graphics animations.ACM Transactions on Graphics (TOG), 44(4):1–17, 2025
Jiaju Ma and Maneesh Agrawala. Mover: Motion verifica- tion for motion graphics animations.ACM Transactions on Graphics (TOG), 44(4):1–17, 2025. 3
2025
-
[58]
arXiv preprint arXiv:2506.07491 , year=
Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor mod- eling.arXiv preprint arXiv:2506.07491, 2025. 1, 2
-
[59]
Animatesvg: autonomous creation and aesthetics evaluation of scalable vector graph- ics animations for the case of brand logos
Deborah Mateja, Rebecca Armbruster, Jonathan Baumert, Tim Bleil, Jakob Langenbahn, Jan Christian Schwedhelm, Sarah Sester, and Armin Heinzl. Animatesvg: autonomous creation and aesthetics evaluation of scalable vector graph- ics animations for the case of brand logos. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15710–15716, 2023. 3
2023
-
[60]
Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Jun- hao Chen, and Yang Long. From frames to sequences: Tem- porally consistent human-centric dense prediction.arXiv preprint arXiv:2602.01661, 2026. 3
-
[61]
Svgeditbench: A benchmark dataset for quantitative assessment of llm’s svg editing capabilities
Kunato Nishina and Yusuke Matsui. Svgeditbench: A benchmark dataset for quantitative assessment of llm’s svg editing capabilities. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8142–8147, 2024. 6
2024
-
[62]
Gpt-5, 2025
OpenAI. Gpt-5, 2025. 7
2025
-
[63]
Sora 2 system card, 2025
OpenAI. Sora 2 system card, 2025. 2, 7, 13
2025
-
[64]
Generating physically sta- ble and buildable brick structures from text
Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically sta- ble and buildable brick structures from text. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 14798–14809, 2025. 1, 2, 3
2025
-
[65]
Lhm: Large animatable human reconstruction model for single image to 3d in seconds
Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, et al. Lhm: Large animatable human reconstruction model for single image to 3d in seconds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14184–14194, 2025. 3
2025
-
[66]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6, 7
2021
-
[67]
Im2vec: Synthesizing vector graphics with- out vector supervision
Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graphics with- out vector supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7342–7351, 2021. 3
2021
-
[68]
Starvector: Gen- erating scalable vector graphics code from images and text
Juan A Rodriguez, Abhay Puri, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Gen- erating scalable vector graphics code from images and text. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 16175–16186, 2025. 4, 7, 2, 13
2025
-
[69]
Starvector: Gen- erating scalable vector graphics code from images and text
Juan A Rodriguez, Abhay Puri, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Gen- erating scalable vector graphics code from images and text. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 16175–16186, 2025. 3
2025
-
[70]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2
2022
-
[71]
Meshgpt: Generating triangle meshes with decoder-only transformers
Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024. 3
2024
-
[72]
arXiv preprint arXiv:2508.10898 , year=
Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei, Fayao Liu, Jiashi Feng, Guosheng Lin, and Jianfeng Zhang. Puppeteer: Rig and animate your 3d models.arXiv preprint arXiv:2508.10898, 2025. 3
-
[73]
Magicarticulate: Make your 3d models articulation-ready
Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, et al. Magicarticulate: Make your 3d models articulation-ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998– 16007, 2025
2025
-
[74]
Drive: Diffusion-based rigging em- powers generation of versatile and expressive characters
Mingze Sun, Junhao Chen, Junting Dong, Yurun Chen, Xinyu Jiang, Shiwei Mao, Puhua Jiang, Jingbo Wang, Bo Dai, and Ruqi Huang. Drive: Diffusion-based rigging em- powers generation of versatile and expressive characters. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 21170–21180, 2025. 3
2025
-
[75]
Edgerunner: Auto-regressive auto-encoder for artistic mesh generation
Jiaxiang Tang, Zhaoshuo Li, Zekun Hao, Xian Liu, Gang Zeng, Ming-Yu Liu, and Qinsheng Zhang. Edgerunner: Auto-regressive auto-encoder for artistic mesh generation. InThe Thirteenth International Conference on Learning Representations. 1, 2, 3
-
[76]
Lgm: Large multi- view gaussian model for high-resolution 3d content cre- ation
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi- view gaussian model for high-resolution 3d content cre- ation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 3
2024
-
[77]
Keyframer: Empowering animation design using large language models
Tiffany Tseng, Ruijia Cheng, and Jeffrey Nichols. Keyframer: Empowering animation design using large lan- guage models.arXiv preprint arXiv:2402.06071, 2024. 3
-
[78]
Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025. 3
-
[79]
Image quality assessment: from error visibility to structural similarity.IEEE transactions on image pro- cessing, 13(4):600–612, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image pro- cessing, 13(4):600–612, 2004. 6
2004
-
[80]
arXiv preprint arXiv:2411.09595 , year=
Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiaohui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.