Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
Pith reviewed 2026-05-21 16:52 UTC · model grok-4.3
The pith
A global-aware autoregressive model generates coherent fonts from few visual examples plus text style descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAR-Font is an autoregressive framework for multimodal few-shot font generation built around a global-aware tokenizer that jointly encodes local glyph structures and global stylistic patterns, a multimodal style encoder that uses a lightweight language-style adapter for flexible textual control without heavy pretraining, and a final post-refinement pipeline that improves fidelity and coherence.
What carries the argument
The global-aware tokenizer, which replaces conventional patch-level tokenization so the autoregressive model can attend to both local structures and global style dependencies across the entire font.
If this is right
- Generated fonts maintain higher global style faithfulness than prior few-shot methods.
- Textual stylistic descriptions can be used directly to steer output quality without extra training.
- The post-refinement pipeline further reduces structural errors that autoregressive generation alone leaves behind.
- The framework works with limited visual references while still producing a coherent glyph set.
Where Pith is reading between the lines
- The same global-tokenization idea could be tested on other tasks that require long-range visual consistency, such as icon or logo sets.
- Adding more language modalities or script families would test whether the lightweight adapter generalizes beyond the evaluated styles.
- If the tokenizer truly encodes global patterns, removing it should produce measurable style drift even when local patches are accurate.
Load-bearing premise
The global-aware tokenizer successfully captures both local glyph structures and global stylistic patterns while the lightweight language-style adapter supplies flexible control without needing intensive multimodal pretraining.
What would settle it
Human or automatic evaluation on a held-out set of reference fonts where GAR-Font outputs show measurable drops in global style consistency or structural integrity compared with strong patch-based baselines.
Figures
read the original abstract
Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce GAR-Font, a global-aware autoregressive model for multimodal few-shot font generation. It addresses limitations in existing FFG methods by proposing a global-aware tokenizer that captures local structures and global stylistic patterns, a multimodal style encoder with a lightweight language-style adapter for flexible style control via textual guidance without intensive pretraining, and a post-refinement pipeline to enhance fidelity. Extensive experiments are said to demonstrate outperformance over existing methods in global style faithfulness and quality with textual stylistic guidance.
Significance. If the results hold, this work could significantly impact the field of few-shot font generation by enabling multimodal control and better global coherence in autoregressive models. The lightweight adapter without heavy pretraining is a notable strength for practical deployment. The authors deserve credit for extending AR models beyond patch-level tokenization in this domain.
major comments (2)
- Abstract: The central claims of outperformance and superior global style faithfulness are presented without any quantitative metrics, baseline comparisons, ablation studies, or details on the experimental protocol. This is a load-bearing issue for assessing the validity of the proposed framework's advantages.
- §3.2 (Multimodal Style Encoder): The lightweight language-style adapter is described as providing flexible control without intensive multimodal pretraining. However, the manuscript does not detail the mechanism ensuring alignment between language embeddings and visual glyph features, raising the possibility that any observed improvements may stem primarily from the global-aware tokenizer or post-refinement rather than the multimodal component.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential impact of GAR-Font. We address each major comment below and have revised the manuscript accordingly to improve clarity and support for our claims.
read point-by-point responses
-
Referee: Abstract: The central claims of outperformance and superior global style faithfulness are presented without any quantitative metrics, baseline comparisons, ablation studies, or details on the experimental protocol. This is a load-bearing issue for assessing the validity of the proposed framework's advantages.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific metrics (e.g., relative improvements in FID and style-consistency scores versus the strongest baselines) and a concise statement of the evaluation protocol, while preserving its high-level character. revision: yes
-
Referee: §3.2 (Multimodal Style Encoder): The lightweight language-style adapter is described as providing flexible control without intensive multimodal pretraining. However, the manuscript does not detail the mechanism ensuring alignment between language embeddings and visual glyph features, raising the possibility that any observed improvements may stem primarily from the global-aware tokenizer or post-refinement rather than the multimodal component.
Authors: We thank the referee for this observation. The language-style adapter aligns embeddings via learned projection layers and cross-attention trained jointly on paired text-image data; however, we acknowledge that the current description is insufficiently explicit. We will expand §3.2 with a precise account of the alignment procedure, the associated loss terms, and new ablation results that isolate the multimodal adapter’s contribution from the tokenizer and refinement stages. revision: yes
Circularity Check
No significant circularity in GAR-Font architectural claims
full rationale
The paper introduces GAR-Font as a new autoregressive framework consisting of a global-aware tokenizer, a multimodal style encoder with lightweight language-style adapter, and a post-refinement pipeline. These components are presented as novel architectural contributions for multimodal few-shot font generation, with performance claims supported by experiments rather than any closed-form derivation or parameter fitting that reduces to prior inputs by construction. No equations, self-definitional loops, fitted predictions, or load-bearing self-citations that would make the central results equivalent to their own definitions are present in the abstract or described framework. The derivation chain is self-contained as an empirical model proposal.
Axiom & Free-Parameter Ledger
invented entities (2)
-
global-aware tokenizer
no independent evidence
-
lightweight language-style adapter
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
global-aware tokenizer (G-Tok) that fuses local features with global perception... hybrid CNN–ViT encoder
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight language-style adapter... aligns textual descriptions with visual style embeddings
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[2]
Flextok: Resam- pling images into 1d token sequences of flexible length
Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resam- pling images into 1d token sequences of flexible length. In Forty-second International Conference on Machine Learn- ing, 2025. 2
work page 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Ankan Kumar Bhunia, Ayan Kumar Bhunia, Prithaj Baner- jee, Aishik Konwer, Abir Bhowmick, Partha Pratim Roy, and Umapada Pal. Word level font-to-font image translation us- ing convolutional recurrent generative adversarial networks. In2018 24th International Conference on Pattern Recogni- tion (ICPR), pages 3645–3650. IEEE, 2018. 2
work page 2018
-
[5]
Efficient-vqgan: To- wards high-resolution image generation with efficient vision transformers
Shiyue Cao, Yueqin Yin, Lianghua Huang, Yu Liu, Xin Zhao, Deli Zhao, and Kaigi Huang. Efficient-vqgan: To- wards high-resolution image generation with efficient vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7368–7377, 2023. 2
work page 2023
-
[6]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection
Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 3
work page 2024
-
[8]
Few-shot composi- tional font generation with dual memory
Junbum Cha, Sanghyuk Chun, Gayoung Lee, Bado Lee, Seonghyeon Kim, and Hwalsuk Lee. Few-shot composi- tional font generation with dual memory. InEuropean con- ference on computer vision, pages 735–751. Springer, 2020. 2
work page 2020
-
[9]
Gener- ating handwritten chinese characters using cyclegan
Bo Chang, Qiong Zhang, Shenyi Pan, and Lili Meng. Gener- ating handwritten chinese characters using cyclegan. In2018 IEEE winter conference on applications of computer vision (WACV), pages 199–207. IEEE, 2018. 2
work page 2018
-
[10]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 2
work page 2022
-
[11]
Chinese handwriting imitation with hierarchical generative adversarial network
Jie Chang, Yujun Gu, Ya Zhang, Yan-Feng Wang, and CM Innovation. Chinese handwriting imitation with hierarchical generative adversarial network. InBMVC, page 290, 2018. 2
work page 2018
-
[12]
Generative pre- training from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020. 2
work page 2020
-
[13]
If-font: Ideo- graphic description sequence-following font generation
Xinping Chen, Xiao Ke, and Wenzhong Guo. If-font: Ideo- graphic description sequence-following font generation. In Advances in Neural Information Processing Systems, pages 14177–14199. Curran Associates, Inc., 2024. 1, 5
work page 2024
-
[14]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2, 3
work page 2021
-
[18]
Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Jun- jun He, and Yu Qiao. Generate like experts: Multi-stage font generation by incorporating font transfer process into diffu- sion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6892–6901,
-
[19]
Yue Gao, Yuan Guo, Zhouhui Lian, Yingmin Tang, and Jian- guo Xiao. Artistic glyph image synthesis via one-stage few- shot learning.ACM Transactions on Graphics (ToG), 38(6): 1–12, 2019. 2
work page 2019
-
[20]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation.arXiv preprint arXiv:2404.14396, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, Dacheng Tao, and Qiao Yu. Diff-font: Diffusion model for robust one-shot font generation.International Journal of Computer Vision, 132(11):5372–5386, 2024. 1, 2, 5
work page 2024
-
[22]
Xiao He, Mingrui Zhu, Nannan Wang, and Xinbo Gao. Few- shot font generation by learning style difference and similar- ity.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8013–8025, 2024. 2
work page 2024
-
[23]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3
work page 2022
-
[24]
Not all image regions matter: Masked vector quan- tization for autoregressive image generation
Mengqi Huang, Zhendong Mao, Quan Wang, and Yongdong Zhang. Not all image regions matter: Masked vector quan- tization for autoregressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 2002–2011, 2023. 2
work page 2002
-
[25]
Spectralar: Spectral autore- gressive visual generation.arXiv preprint arXiv:2506.10962,
Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autore- gressive visual generation.arXiv preprint arXiv:2506.10962,
-
[26]
Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, and Xuelong Li. Nfig: Au- toregressive image generation with next-frequency predic- tion.arXiv preprint arXiv:2503.07076, 2025. 2
-
[27]
Scfont: Structure-guided chinese font generation via deep stacked networks
Yue Jiang, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. Scfont: Structure-guided chinese font generation via deep stacked networks. InProceedings of the AAAI conference on artificial intelligence, pages 4015–4022, 2019. 2
work page 2019
-
[28]
Younghwi Kim, Seok Chan Jeong, and Sunghyun Sim. Legacy learning using few-shot font generation models for automatic text design in metaverse content: Cases studies in korean and chinese.arXiv preprint arXiv:2408.16900, 2024. 2
-
[29]
Look closer to supervise better: One-shot font generation via component- based discriminator
Yuxin Kong, Canjie Luo, Weihong Ma, Qiyuan Zhu, Sheng- gao Zhu, Nicholas Yuan, and Lianwen Jin. Look closer to supervise better: One-shot font generation via component- based discriminator. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13482–13491, 2022. 1
work page 2022
-
[30]
Fontadapter: Instant font adaptation in visual text generation.arXiv preprint arXiv:2506.05843, 2025
Myungkyu Koo, Subin Kim, Sangkyung Kwak, Jaehyun Nam, Seojin Kim, and Jinwoo Shin. Fontadapter: Instant font adaptation in visual text generation.arXiv preprint arXiv:2506.05843, 2025. 3
-
[31]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 11523–11532, 2022. 2
work page 2022
-
[32]
Llava-med: Training a large language- and-vision assistant for biomedicine in one day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, pages 28541– 28564. Curran Associates, Inc., 2023. 2
work page 2023
-
[33]
Hua Li and Zhouhui Lian. Hfh-font: Few-shot chinese font synthesis with higher quality, faster speed, and higher reso- lution.ACM Transactions on Graphics (TOG), 43(6):1–16,
-
[34]
Fstdiff: One-shot font generation via cross-font style transformation learning
Shilin Li and Anna Zhu. Fstdiff: One-shot font generation via cross-font style transformation learning. InInternational Conference on Document Analysis and Recognition, pages 167–182. Springer, 2025. 2
work page 2025
-
[35]
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024. 2
-
[36]
Cvfont: Synthesizing chinese vector fonts via deep layout inferring
Zhouhui Lian and Yichen Gao. Cvfont: Synthesizing chinese vector fonts via deep layout inferring. InComputer Graphics Forum, pages 212–225. Wiley Online Library, 2022. 2
work page 2022
-
[37]
Zhouhui Lian, Bo Zhao, Xudong Chen, and Jianguo Xiao. Easyfont: a style learning-based system to easily build your large-scale handwriting fonts.ACM Transactions on Graph- ics (TOG), 38(1):1–18, 2018. 2
work page 2018
-
[38]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Xmp- font: Self-supervised cross-modality pre-training for few- shot font generation
Wei Liu, Fangyue Liu, Fei Ding, Qian He, and Zili Yi. Xmp- font: Self-supervised cross-modality pre-training for few- shot font generation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7905–7914, 2022. 2
work page 2022
-
[40]
Xiao-Qian Liu, Peng-Fei Zhang, Xin Luo, Zi Huang, and Xin-Shun Xu. Textadapter: Self-supervised domain adapta- tion for cross-domain text recognition.IEEE Transactions on Multimedia, 26:9854–9865, 2024. 3
work page 2024
-
[41]
Dualvector: Unsupervised vector font synthesis with dual-part represen- tation
Ying-Tian Liu, Zhifei Zhang, Yuan-Chen Guo, Matthew Fisher, Zhaowen Wang, and Song-Hai Zhang. Dualvector: Unsupervised vector font synthesis with dual-part represen- tation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14193–14202,
-
[42]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26439–26455, 2024. 2
work page 2024
-
[43]
Yuxuan Luo, Jiaqi Tang, Chenyi Huang, Feiyang Hao, and Zhouhui Lian. Callireader: Contextualizing chinese calligra- phy via an embedding-aligned vision-language model.arXiv preprint arXiv:2503.06472, 2025. 3
-
[44]
Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li, Huaian Chen, and Yi Jin. Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024. 2
-
[45]
Wei Pan, Anna Zhu, Xinyu Zhou, Brian Kenji Iwana, and Shilin Li. Few shot font generation via transferring simi- larity guided global style and quantization local style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19506–19516, 2023. 2
work page 2023
-
[46]
Few-shot font generation with localized style representations and factorization
Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, and Hyunjung Shim. Few-shot font generation with localized style representations and factorization. InProceedings of the AAAI conference on artificial intelligence, pages 2393–2402,
-
[47]
Multiple heads are better than one: Few- shot font generation with multiple localized experts
Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, and Hyunjung Shim. Multiple heads are better than one: Few- shot font generation with multiple localized experts. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 13900–13909, 2021. 2
work page 2021
-
[48]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- age transformer. InInternational conference on machine learning, pages 4055–4064. PMLR, 2018. 2
work page 2018
-
[49]
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019. 2
work page 2019
-
[50]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Fonts: Text rendering with typography and style controls
Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 18463–18474,
-
[52]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Emu: Generative Pretraining in Multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Few-shot font generation by learning fine-grained local styles
Licheng Tang, Yiyang Cai, Jiaming Liu, Zhibin Hong, Ming- ming Gong, Minhu Fan, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Few-shot font generation by learning fine-grained local styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7895–7904, 2022. 2, 3, 4
work page 2022
-
[55]
Fontrnn: Generating large-scale chinese fonts via recurrent neural network
Shusen Tang, Zeqing Xia, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. Fontrnn: Generating large-scale chinese fonts via recurrent neural network. InComputer Graphics Forum, pages 567–577. Wiley Online Library, 2019. 2
work page 2019
-
[56]
Vecfusion: Vector font gen- eration with diffusion
Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Micha¨el Gharbi, Oliver Wang, Alec Jacob- son, and Evangelos Kalogerakis. Vecfusion: Vector font gen- eration with diffusion. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7943–7952, 2024. 2
work page 2024
-
[57]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 2
work page 2024
-
[58]
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 3
work page 2017
-
[59]
Cf-font: Content fusion for few-shot font generation
Chi Wang, Min Zhou, Tiezheng Ge, Yuning Jiang, Hujun Bao, and Weiwei Xu. Cf-font: Content fusion for few-shot font generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1858– 1867, 2023. 5
work page 2023
-
[60]
Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl
Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 2
-
[61]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Yizhi Wang and Zhouhui Lian. Deepvecfont: synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6):1–15, 2021. 2
work page 2021
-
[63]
Deepvecfont-v2: Exploiting transformers to synthesize vector fonts with higher quality
Yuqing Wang, Yizhi Wang, Longhui Yu, Yuesheng Zhu, and Zhouhui Lian. Deepvecfont-v2: Exploiting transformers to synthesize vector fonts with higher quality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18320–18328, 2023. 2
work page 2023
-
[64]
Parallelized autoregressive visual generation
Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12955–12965, 2025. 2
work page 2025
-
[65]
Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, and Zicheng Liu. Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025. 2
-
[66]
Zigan: Fine-grained chinese calligraphy font generation via a few- shot style transfer approach
Qi Wen, Shuang Li, Bingfeng Han, and Yi Yuan. Zigan: Fine-grained chinese calligraphy font generation via a few- shot style transfer approach. InProceedings of the 29th ACM international conference on multimedia, pages 621– 629, 2021. 2
work page 2021
-
[67]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Zeqing Xia, Bojun Xiong, and Zhouhui Lian. Vecfontsdf: Learning to reconstruct and synthesize high-quality vec- tor fonts via signed distance functions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1848–1857, 2023. 2
work page 2023
-
[69]
Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, and Boyu Wang. Palm2-vadapter: progressively aligned language model makes a strong vision-language adapter.arXiv preprint arXiv:2402.10896, 2024. 3
-
[70]
Dg- font: Deformable generative networks for unsupervised font generation
Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. Dg- font: Deformable generative networks for unsupervised font generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5130–5140,
-
[71]
Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, and Lianwen Jin. Fontdiffuser: One-shot font genera- tion via denoising diffusion with multi-scale content aggre- gation and style contrastive learning. InProceedings of the AAAI conference on artificial intelligence, pages 6603–6611,
-
[72]
Vq-font: Few-shot font generation with structure-aware enhancement and quantization
Mingshuai Yao, Yabo Zhang, Xianhui Lin, Xiaoming Li, and Wangmeng Zuo. Vq-font: Few-shot font generation with structure-aware enhancement and quantization. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 16407–16415, 2024. 1, 2, 3, 5
work page 2024
-
[73]
Vector-quantized Image Modeling with Improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[74]
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi- modal models: Pretraining and instruction tuning.arXiv preprint arXiv:2309.02591, 2023. 2
-
[75]
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940– 128966, 2024. 1, 2
work page 2024
-
[76]
Randomized autoregressive visual generation
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025. 2
work page 2025
-
[77]
Language- guided image tokenization for generation
Kaiwen Zha, Lijun Yu, Alireza Fathi, David A Ross, Cordelia Schmid, Dina Katabi, and Xiuye Gu. Language- guided image tokenization for generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15713–15722, 2025. 2
work page 2025
-
[79]
Separating style and content for generalized style transfer
Yexun Zhang, Ya Zhang, and Wenbin Cai. Separating style and content for generalized style transfer. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 8447–8455, 2018. 2
work page 2018
-
[80]
Holistic tokenizer for autoregressive image generation.arXiv preprint arXiv:2507.02358, 2025
Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Holistic tokenizer for autoregressive image generation.arXiv preprint arXiv:2507.02358, 2025. 2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.