Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

Haonan Cai; Yuxuan Luo; Zhouhui Lian

arxiv: 2601.01593 · v2 · pith:LZN4MEDOnew · submitted 2026-01-04 · 💻 cs.CV · cs.MM

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

Haonan Cai , Yuxuan Luo , Zhouhui Lian This is my paper

Pith reviewed 2026-05-21 16:52 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords few-shot font generationautoregressive modelglobal-aware tokenizermultimodal font synthesisstyle transferglyph generationtext-guided generation

0 comments

The pith

A global-aware autoregressive model generates coherent fonts from few visual examples plus text style descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that conventional patch-based tokenization in autoregressive models misses the global stylistic dependencies needed for consistent font synthesis from limited references. It proposes that adding a global-aware tokenizer plus a lightweight language-style adapter lets the model preserve both local glyph structure and overall style while accepting textual guidance. A sympathetic reader would care because font design requires turning a stylistic idea into a full set of matching glyphs, and current automated methods often produce inconsistent or low-fidelity results when given only a handful of examples. The work also adds a post-refinement step to further tighten structural and stylistic coherence.

Core claim

GAR-Font is an autoregressive framework for multimodal few-shot font generation built around a global-aware tokenizer that jointly encodes local glyph structures and global stylistic patterns, a multimodal style encoder that uses a lightweight language-style adapter for flexible textual control without heavy pretraining, and a final post-refinement pipeline that improves fidelity and coherence.

What carries the argument

The global-aware tokenizer, which replaces conventional patch-level tokenization so the autoregressive model can attend to both local structures and global style dependencies across the entire font.

If this is right

Generated fonts maintain higher global style faithfulness than prior few-shot methods.
Textual stylistic descriptions can be used directly to steer output quality without extra training.
The post-refinement pipeline further reduces structural errors that autoregressive generation alone leaves behind.
The framework works with limited visual references while still producing a coherent glyph set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-tokenization idea could be tested on other tasks that require long-range visual consistency, such as icon or logo sets.
Adding more language modalities or script families would test whether the lightweight adapter generalizes beyond the evaluated styles.
If the tokenizer truly encodes global patterns, removing it should produce measurable style drift even when local patches are accurate.

Load-bearing premise

The global-aware tokenizer successfully captures both local glyph structures and global stylistic patterns while the lightweight language-style adapter supplies flexible control without needing intensive multimodal pretraining.

What would settle it

Human or automatic evaluation on a held-out set of reference fonts where GAR-Font outputs show measurable drops in global style consistency or structural integrity compared with strong patch-based baselines.

Figures

Figures reproduced from arXiv: 2601.01593 by Haonan Cai, Yuxuan Luo, Zhouhui Lian.

**Figure 2.** Figure 2: The overall architecture of GAR-Font. It comprises a [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Overview of the G-Tok architecture, which adopts a hybrid CNN–ViT design. (b) Details of the global ViT encoder and causal [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: GAR-Font adopts a two-stage training: (a) Visual Pre [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on vision-only FFG (UFSC, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on multimodal FFG(UFSC, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAR-Font adds global tokenization and a lightweight text adapter to autoregressive few-shot font generation, but the abstract shows no metrics to back the performance claims.

read the letter

Here's the quick take on GAR-Font from arXiv:2601.01593. The paper moves autoregressive generation for fonts beyond the usual patch tokenization. It adds a global-aware tokenizer to pick up both local glyph details and overall style patterns. On top of that, it brings in a multimodal style encoder with a lightweight language-style adapter so you can guide the output with text descriptions of the style, without needing heavy pretraining on multimodal data. A post-refinement pipeline cleans up the results for better structure and coherence. What stands out is how it tries to handle the global dependencies that patch methods miss and opens the door to language input in font design, which prior image-to-image FFG approaches skip. If the experiments hold, this could make few-shot generation more practical for cases where you have a style description in words along with a few reference images. The abstract claims better performance and higher global faithfulness, especially with the text guidance. That sounds promising for digital typography work. On the downside, the abstract itself has no quantitative results, no baseline numbers, and no ablation details. The soundness looks thin until you see the actual tables and protocols. The lightweight adapter is the key for the multimodal claim, but if it doesn't properly align the language embeddings with the visual features, the text guidance might not deliver the promised lift over visual-only references. That part needs close checking in the full experiments. This work is aimed at researchers in computer vision who focus on generative models for creative applications like font design. Someone already familiar with autoregressive models or few-shot generation in vision would get the most out of it and could test the new components in their own setups. I think it deserves a serious referee. The ideas are specific enough and the problem is real, so peer review can sort out whether the gains are solid.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce GAR-Font, a global-aware autoregressive model for multimodal few-shot font generation. It addresses limitations in existing FFG methods by proposing a global-aware tokenizer that captures local structures and global stylistic patterns, a multimodal style encoder with a lightweight language-style adapter for flexible style control via textual guidance without intensive pretraining, and a post-refinement pipeline to enhance fidelity. Extensive experiments are said to demonstrate outperformance over existing methods in global style faithfulness and quality with textual stylistic guidance.

Significance. If the results hold, this work could significantly impact the field of few-shot font generation by enabling multimodal control and better global coherence in autoregressive models. The lightweight adapter without heavy pretraining is a notable strength for practical deployment. The authors deserve credit for extending AR models beyond patch-level tokenization in this domain.

major comments (2)

Abstract: The central claims of outperformance and superior global style faithfulness are presented without any quantitative metrics, baseline comparisons, ablation studies, or details on the experimental protocol. This is a load-bearing issue for assessing the validity of the proposed framework's advantages.
§3.2 (Multimodal Style Encoder): The lightweight language-style adapter is described as providing flexible control without intensive multimodal pretraining. However, the manuscript does not detail the mechanism ensuring alignment between language embeddings and visual glyph features, raising the possibility that any observed improvements may stem primarily from the global-aware tokenizer or post-refinement rather than the multimodal component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of GAR-Font. We address each major comment below and have revised the manuscript accordingly to improve clarity and support for our claims.

read point-by-point responses

Referee: Abstract: The central claims of outperformance and superior global style faithfulness are presented without any quantitative metrics, baseline comparisons, ablation studies, or details on the experimental protocol. This is a load-bearing issue for assessing the validity of the proposed framework's advantages.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific metrics (e.g., relative improvements in FID and style-consistency scores versus the strongest baselines) and a concise statement of the evaluation protocol, while preserving its high-level character. revision: yes
Referee: §3.2 (Multimodal Style Encoder): The lightweight language-style adapter is described as providing flexible control without intensive multimodal pretraining. However, the manuscript does not detail the mechanism ensuring alignment between language embeddings and visual glyph features, raising the possibility that any observed improvements may stem primarily from the global-aware tokenizer or post-refinement rather than the multimodal component.

Authors: We thank the referee for this observation. The language-style adapter aligns embeddings via learned projection layers and cross-attention trained jointly on paired text-image data; however, we acknowledge that the current description is insufficiently explicit. We will expand §3.2 with a precise account of the alignment procedure, the associated loss terms, and new ablation results that isolate the multimodal adapter’s contribution from the tokenizer and refinement stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GAR-Font architectural claims

full rationale

The paper introduces GAR-Font as a new autoregressive framework consisting of a global-aware tokenizer, a multimodal style encoder with lightweight language-style adapter, and a post-refinement pipeline. These components are presented as novel architectural contributions for multimodal few-shot font generation, with performance claims supported by experiments rather than any closed-form derivation or parameter fitting that reduces to prior inputs by construction. No equations, self-definitional loops, fitted predictions, or load-bearing self-citations that would make the central results equivalent to their own definitions are present in the abstract or described framework. The derivation chain is self-contained as an empirical model proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review is limited to the abstract; full architectural and training details are unavailable. The model introduces new components whose internal hyperparameters and training assumptions are not specified.

invented entities (2)

global-aware tokenizer no independent evidence
purpose: Captures both local glyph structures and global stylistic patterns
Introduced as a core new component to overcome patch-level limitations.
lightweight language-style adapter no independent evidence
purpose: Enables flexible textual style control without intensive multimodal pretraining
Proposed as part of the multimodal style encoder.

pith-pipeline@v0.9.0 · 5737 in / 1047 out tokens · 42885 ms · 2026-05-21T16:52:33.294387+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global-aware tokenizer (G-Tok) that fuses local features with global perception... hybrid CNN–ViT encoder
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight language-style adapter... aligns textual descriptions with visual style embeddings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 13 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[2]

Flextok: Resam- pling images into 1d token sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resam- pling images into 1d token sequences of flexible length. In Forty-second International Conference on Machine Learn- ing, 2025. 2

work page 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Word level font-to-font image translation us- ing convolutional recurrent generative adversarial networks

Ankan Kumar Bhunia, Ayan Kumar Bhunia, Prithaj Baner- jee, Aishik Konwer, Abir Bhowmick, Partha Pratim Roy, and Umapada Pal. Word level font-to-font image translation us- ing convolutional recurrent generative adversarial networks. In2018 24th International Conference on Pattern Recogni- tion (ICPR), pages 3645–3650. IEEE, 2018. 2

work page 2018
[5]

Efficient-vqgan: To- wards high-resolution image generation with efficient vision transformers

Shiyue Cao, Yueqin Yin, Lianghua Huang, Yu Liu, Xin Zhao, Deli Zhao, and Kaigi Huang. Efficient-vqgan: To- wards high-resolution image generation with efficient vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7368–7377, 2023. 2

work page 2023
[6]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection

Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 3

work page 2024
[8]

Few-shot composi- tional font generation with dual memory

Junbum Cha, Sanghyuk Chun, Gayoung Lee, Bado Lee, Seonghyeon Kim, and Hwalsuk Lee. Few-shot composi- tional font generation with dual memory. InEuropean con- ference on computer vision, pages 735–751. Springer, 2020. 2

work page 2020
[9]

Gener- ating handwritten chinese characters using cyclegan

Bo Chang, Qiong Zhang, Shenyi Pan, and Lili Meng. Gener- ating handwritten chinese characters using cyclegan. In2018 IEEE winter conference on applications of computer vision (WACV), pages 199–207. IEEE, 2018. 2

work page 2018
[10]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 2

work page 2022
[11]

Chinese handwriting imitation with hierarchical generative adversarial network

Jie Chang, Yujun Gu, Ya Zhang, Yan-Feng Wang, and CM Innovation. Chinese handwriting imitation with hierarchical generative adversarial network. InBMVC, page 290, 2018. 2

work page 2018
[12]

Generative pre- training from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020. 2

work page 2020
[13]

If-font: Ideo- graphic description sequence-following font generation

Xinping Chen, Xiao Ke, and Wenzhong Guo. If-font: Ideo- graphic description sequence-following font generation. In Advances in Neural Information Processing Systems, pages 14177–14199. Curran Associates, Inc., 2024. 1, 5

work page 2024
[14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2, 3

work page 2021
[18]

Generate like experts: Multi-stage font generation by incorporating font transfer process into diffu- sion models

Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Jun- jun He, and Yu Qiao. Generate like experts: Multi-stage font generation by incorporating font transfer process into diffu- sion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6892–6901,

work page
[19]

Artistic glyph image synthesis via one-stage few- shot learning.ACM Transactions on Graphics (ToG), 38(6): 1–12, 2019

Yue Gao, Yuan Guo, Zhouhui Lian, Yingmin Tang, and Jian- guo Xiao. Artistic glyph image synthesis via one-stage few- shot learning.ACM Transactions on Graphics (ToG), 38(6): 1–12, 2019. 2

work page 2019
[20]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation.arXiv preprint arXiv:2404.14396, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Diff-font: Diffusion model for robust one-shot font generation.International Journal of Computer Vision, 132(11):5372–5386, 2024

Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, Dacheng Tao, and Qiao Yu. Diff-font: Diffusion model for robust one-shot font generation.International Journal of Computer Vision, 132(11):5372–5386, 2024. 1, 2, 5

work page 2024
[22]

Few- shot font generation by learning style difference and similar- ity.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8013–8025, 2024

Xiao He, Mingrui Zhu, Nannan Wang, and Xinbo Gao. Few- shot font generation by learning style difference and similar- ity.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8013–8025, 2024. 2

work page 2024
[23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

work page 2022
[24]

Not all image regions matter: Masked vector quan- tization for autoregressive image generation

Mengqi Huang, Zhendong Mao, Quan Wang, and Yongdong Zhang. Not all image regions matter: Masked vector quan- tization for autoregressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 2002–2011, 2023. 2

work page 2002
[25]

Spectralar: Spectral autore- gressive visual generation.arXiv preprint arXiv:2506.10962,

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autore- gressive visual generation.arXiv preprint arXiv:2506.10962,

work page arXiv
[26]

Nfig: Au- toregressive image generation with next-frequency predic- tion.arXiv preprint arXiv:2503.07076, 2025

Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, and Xuelong Li. Nfig: Au- toregressive image generation with next-frequency predic- tion.arXiv preprint arXiv:2503.07076, 2025. 2

work page arXiv 2025
[27]

Scfont: Structure-guided chinese font generation via deep stacked networks

Yue Jiang, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. Scfont: Structure-guided chinese font generation via deep stacked networks. InProceedings of the AAAI conference on artificial intelligence, pages 4015–4022, 2019. 2

work page 2019
[28]

Legacy learning using few-shot font generation models for automatic text design in metaverse content: Cases studies in korean and chinese.arXiv preprint arXiv:2408.16900, 2024

Younghwi Kim, Seok Chan Jeong, and Sunghyun Sim. Legacy learning using few-shot font generation models for automatic text design in metaverse content: Cases studies in korean and chinese.arXiv preprint arXiv:2408.16900, 2024. 2

work page arXiv 2024
[29]

Look closer to supervise better: One-shot font generation via component- based discriminator

Yuxin Kong, Canjie Luo, Weihong Ma, Qiyuan Zhu, Sheng- gao Zhu, Nicholas Yuan, and Lianwen Jin. Look closer to supervise better: One-shot font generation via component- based discriminator. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13482–13491, 2022. 1

work page 2022
[30]

Fontadapter: Instant font adaptation in visual text generation.arXiv preprint arXiv:2506.05843, 2025

Myungkyu Koo, Subin Kim, Sangkyung Kwak, Jaehyun Nam, Seojin Kim, and Jinwoo Shin. Fontadapter: Instant font adaptation in visual text generation.arXiv preprint arXiv:2506.05843, 2025. 3

work page arXiv 2025
[31]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 11523–11532, 2022. 2

work page 2022
[32]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, pages 28541– 28564. Curran Associates, Inc., 2023. 2

work page 2023
[33]

Hfh-font: Few-shot chinese font synthesis with higher quality, faster speed, and higher reso- lution.ACM Transactions on Graphics (TOG), 43(6):1–16,

Hua Li and Zhouhui Lian. Hfh-font: Few-shot chinese font synthesis with higher quality, faster speed, and higher reso- lution.ACM Transactions on Graphics (TOG), 43(6):1–16,

work page
[34]

Fstdiff: One-shot font generation via cross-font style transformation learning

Shilin Li and Anna Zhu. Fstdiff: One-shot font generation via cross-font style transformation learning. InInternational Conference on Document Analysis and Recognition, pages 167–182. Springer, 2025. 2

work page 2025
[35]

Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024. 2

work page arXiv 2024
[36]

Cvfont: Synthesizing chinese vector fonts via deep layout inferring

Zhouhui Lian and Yichen Gao. Cvfont: Synthesizing chinese vector fonts via deep layout inferring. InComputer Graphics Forum, pages 212–225. Wiley Online Library, 2022. 2

work page 2022
[37]

Easyfont: a style learning-based system to easily build your large-scale handwriting fonts.ACM Transactions on Graph- ics (TOG), 38(1):1–18, 2018

Zhouhui Lian, Bo Zhao, Xudong Chen, and Jianguo Xiao. Easyfont: a style learning-based system to easily build your large-scale handwriting fonts.ACM Transactions on Graph- ics (TOG), 38(1):1–18, 2018. 2

work page 2018
[38]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Xmp- font: Self-supervised cross-modality pre-training for few- shot font generation

Wei Liu, Fangyue Liu, Fei Ding, Qian He, and Zili Yi. Xmp- font: Self-supervised cross-modality pre-training for few- shot font generation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7905–7914, 2022. 2

work page 2022
[40]

Textadapter: Self-supervised domain adapta- tion for cross-domain text recognition.IEEE Transactions on Multimedia, 26:9854–9865, 2024

Xiao-Qian Liu, Peng-Fei Zhang, Xin Luo, Zi Huang, and Xin-Shun Xu. Textadapter: Self-supervised domain adapta- tion for cross-domain text recognition.IEEE Transactions on Multimedia, 26:9854–9865, 2024. 3

work page 2024
[41]

Dualvector: Unsupervised vector font synthesis with dual-part represen- tation

Ying-Tian Liu, Zhifei Zhang, Yuan-Chen Guo, Matthew Fisher, Zhaowen Wang, and Song-Hai Zhang. Dualvector: Unsupervised vector font synthesis with dual-part represen- tation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14193–14202,

work page
[42]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26439–26455, 2024. 2

work page 2024
[43]

Callireader: Contextualizing chinese calligra- phy via an embedding-aligned vision-language model.arXiv preprint arXiv:2503.06472, 2025

Yuxuan Luo, Jiaqi Tang, Chenyi Huang, Feiyang Hao, and Zhouhui Lian. Callireader: Contextualizing chinese calligra- phy via an embedding-aligned vision-language model.arXiv preprint arXiv:2503.06472, 2025. 3

work page arXiv 2025
[44]

Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024

Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li, Huaian Chen, and Yi Jin. Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024. 2

work page arXiv 2024
[45]

Few shot font generation via transferring simi- larity guided global style and quantization local style

Wei Pan, Anna Zhu, Xinyu Zhou, Brian Kenji Iwana, and Shilin Li. Few shot font generation via transferring simi- larity guided global style and quantization local style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19506–19516, 2023. 2

work page 2023
[46]

Few-shot font generation with localized style representations and factorization

Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, and Hyunjung Shim. Few-shot font generation with localized style representations and factorization. InProceedings of the AAAI conference on artificial intelligence, pages 2393–2402,

work page
[47]

Multiple heads are better than one: Few- shot font generation with multiple localized experts

Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, and Hyunjung Shim. Multiple heads are better than one: Few- shot font generation with multiple localized experts. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 13900–13909, 2021. 2

work page 2021
[48]

Im- age transformer

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- age transformer. InInternational conference on machine learning, pages 4055–4064. PMLR, 2018. 2

work page 2018
[49]

Gener- ating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019. 2

work page 2019
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Fonts: Text rendering with typography and style controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 18463–18474,

work page
[52]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Emu: Generative Pretraining in Multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Few-shot font generation by learning fine-grained local styles

Licheng Tang, Yiyang Cai, Jiaming Liu, Zhibin Hong, Ming- ming Gong, Minhu Fan, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Few-shot font generation by learning fine-grained local styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7895–7904, 2022. 2, 3, 4

work page 2022
[55]

Fontrnn: Generating large-scale chinese fonts via recurrent neural network

Shusen Tang, Zeqing Xia, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. Fontrnn: Generating large-scale chinese fonts via recurrent neural network. InComputer Graphics Forum, pages 567–577. Wiley Online Library, 2019. 2

work page 2019
[56]

Vecfusion: Vector font gen- eration with diffusion

Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Micha¨el Gharbi, Oliver Wang, Alec Jacob- son, and Evangelos Kalogerakis. Vecfusion: Vector font gen- eration with diffusion. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7943–7952, 2024. 2

work page 2024
[57]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 2

work page 2024
[58]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 3

work page 2017
[59]

Cf-font: Content fusion for few-shot font generation

Chi Wang, Min Zhou, Tiezheng Ge, Yuning Jiang, Hujun Bao, and Weiwei Xu. Cf-font: Content fusion for few-shot font generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1858– 1867, 2023. 5

work page 2023
[60]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 2

work page arXiv 2025
[61]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Deepvecfont: synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6):1–15, 2021

Yizhi Wang and Zhouhui Lian. Deepvecfont: synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6):1–15, 2021. 2

work page 2021
[63]

Deepvecfont-v2: Exploiting transformers to synthesize vector fonts with higher quality

Yuqing Wang, Yizhi Wang, Longhui Yu, Yuesheng Zhu, and Zhouhui Lian. Deepvecfont-v2: Exploiting transformers to synthesize vector fonts with higher quality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18320–18328, 2023. 2

work page 2023
[64]

Parallelized autoregressive visual generation

Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12955–12965, 2025. 2

work page 2025
[65]

Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025

Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, and Zicheng Liu. Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025. 2

work page arXiv 2025
[66]

Zigan: Fine-grained chinese calligraphy font generation via a few- shot style transfer approach

Qi Wen, Shuang Li, Bingfeng Han, and Yi Yuan. Zigan: Fine-grained chinese calligraphy font generation via a few- shot style transfer approach. InProceedings of the 29th ACM international conference on multimedia, pages 621– 629, 2021. 2

work page 2021
[67]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Vecfontsdf: Learning to reconstruct and synthesize high-quality vec- tor fonts via signed distance functions

Zeqing Xia, Bojun Xiong, and Zhouhui Lian. Vecfontsdf: Learning to reconstruct and synthesize high-quality vec- tor fonts via signed distance functions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1848–1857, 2023. 2

work page 2023
[69]

Palm2-vadapter: progressively aligned language model makes a strong vision-language adapter.arXiv preprint arXiv:2402.10896, 2024

Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, and Boyu Wang. Palm2-vadapter: progressively aligned language model makes a strong vision-language adapter.arXiv preprint arXiv:2402.10896, 2024. 3

work page arXiv 2024
[70]

Dg- font: Deformable generative networks for unsupervised font generation

Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. Dg- font: Deformable generative networks for unsupervised font generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5130–5140,

work page
[71]

Fontdiffuser: One-shot font genera- tion via denoising diffusion with multi-scale content aggre- gation and style contrastive learning

Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, and Lianwen Jin. Fontdiffuser: One-shot font genera- tion via denoising diffusion with multi-scale content aggre- gation and style contrastive learning. InProceedings of the AAAI conference on artificial intelligence, pages 6603–6611,

work page
[72]

Vq-font: Few-shot font generation with structure-aware enhancement and quantization

Mingshuai Yao, Yabo Zhang, Xianhui Lin, Xiaoming Li, and Wangmeng Zuo. Vq-font: Few-shot font generation with structure-aware enhancement and quantization. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 16407–16415, 2024. 1, 2, 3, 5

work page 2024
[73]

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[74]

Scaling autoregressive multi- modal models: Pretraining and instruction tuning.arXiv preprint arXiv:2309.02591, 2023

Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi- modal models: Pretraining and instruction tuning.arXiv preprint arXiv:2309.02591, 2023. 2

work page arXiv 2023
[75]

An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940– 128966, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940– 128966, 2024. 1, 2

work page 2024
[76]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025. 2

work page 2025
[77]

Language- guided image tokenization for generation

Kaiwen Zha, Lijun Yu, Alireza Fathi, David A Ross, Cordelia Schmid, Dina Katabi, and Xiuye Gu. Language- guided image tokenization for generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15713–15722, 2025. 2

work page 2025
[79]

Separating style and content for generalized style transfer

Yexun Zhang, Ya Zhang, and Wenbin Cai. Separating style and content for generalized style transfer. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 8447–8455, 2018. 2

work page 2018
[80]

Holistic tokenizer for autoregressive image generation.arXiv preprint arXiv:2507.02358, 2025

Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Holistic tokenizer for autoregressive image generation.arXiv preprint arXiv:2507.02358, 2025. 2

work page arXiv 2025

[1] [1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page

[2] [2]

Flextok: Resam- pling images into 1d token sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resam- pling images into 1d token sequences of flexible length. In Forty-second International Conference on Machine Learn- ing, 2025. 2

work page 2025

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Word level font-to-font image translation us- ing convolutional recurrent generative adversarial networks

Ankan Kumar Bhunia, Ayan Kumar Bhunia, Prithaj Baner- jee, Aishik Konwer, Abir Bhowmick, Partha Pratim Roy, and Umapada Pal. Word level font-to-font image translation us- ing convolutional recurrent generative adversarial networks. In2018 24th International Conference on Pattern Recogni- tion (ICPR), pages 3645–3650. IEEE, 2018. 2

work page 2018

[5] [5]

Efficient-vqgan: To- wards high-resolution image generation with efficient vision transformers

Shiyue Cao, Yueqin Yin, Lianghua Huang, Yu Liu, Xin Zhao, Deli Zhao, and Kaigi Huang. Efficient-vqgan: To- wards high-resolution image generation with efficient vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7368–7377, 2023. 2

work page 2023

[6] [6]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection

Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 3

work page 2024

[8] [8]

Few-shot composi- tional font generation with dual memory

Junbum Cha, Sanghyuk Chun, Gayoung Lee, Bado Lee, Seonghyeon Kim, and Hwalsuk Lee. Few-shot composi- tional font generation with dual memory. InEuropean con- ference on computer vision, pages 735–751. Springer, 2020. 2

work page 2020

[9] [9]

Gener- ating handwritten chinese characters using cyclegan

Bo Chang, Qiong Zhang, Shenyi Pan, and Lili Meng. Gener- ating handwritten chinese characters using cyclegan. In2018 IEEE winter conference on applications of computer vision (WACV), pages 199–207. IEEE, 2018. 2

work page 2018

[10] [10]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 2

work page 2022

[11] [11]

Chinese handwriting imitation with hierarchical generative adversarial network

Jie Chang, Yujun Gu, Ya Zhang, Yan-Feng Wang, and CM Innovation. Chinese handwriting imitation with hierarchical generative adversarial network. InBMVC, page 290, 2018. 2

work page 2018

[12] [12]

Generative pre- training from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020. 2

work page 2020

[13] [13]

If-font: Ideo- graphic description sequence-following font generation

Xinping Chen, Xiao Ke, and Wenzhong Guo. If-font: Ideo- graphic description sequence-following font generation. In Advances in Neural Information Processing Systems, pages 14177–14199. Curran Associates, Inc., 2024. 1, 5

work page 2024

[14] [14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[17] [17]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2, 3

work page 2021

[18] [18]

Generate like experts: Multi-stage font generation by incorporating font transfer process into diffu- sion models

Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Jun- jun He, and Yu Qiao. Generate like experts: Multi-stage font generation by incorporating font transfer process into diffu- sion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6892–6901,

work page

[19] [19]

Artistic glyph image synthesis via one-stage few- shot learning.ACM Transactions on Graphics (ToG), 38(6): 1–12, 2019

Yue Gao, Yuan Guo, Zhouhui Lian, Yingmin Tang, and Jian- guo Xiao. Artistic glyph image synthesis via one-stage few- shot learning.ACM Transactions on Graphics (ToG), 38(6): 1–12, 2019. 2

work page 2019

[20] [20]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation.arXiv preprint arXiv:2404.14396, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Diff-font: Diffusion model for robust one-shot font generation.International Journal of Computer Vision, 132(11):5372–5386, 2024

Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, Dacheng Tao, and Qiao Yu. Diff-font: Diffusion model for robust one-shot font generation.International Journal of Computer Vision, 132(11):5372–5386, 2024. 1, 2, 5

work page 2024

[22] [22]

Few- shot font generation by learning style difference and similar- ity.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8013–8025, 2024

Xiao He, Mingrui Zhu, Nannan Wang, and Xinbo Gao. Few- shot font generation by learning style difference and similar- ity.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8013–8025, 2024. 2

work page 2024

[23] [23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

work page 2022

[24] [24]

Not all image regions matter: Masked vector quan- tization for autoregressive image generation

Mengqi Huang, Zhendong Mao, Quan Wang, and Yongdong Zhang. Not all image regions matter: Masked vector quan- tization for autoregressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 2002–2011, 2023. 2

work page 2002

[25] [25]

Spectralar: Spectral autore- gressive visual generation.arXiv preprint arXiv:2506.10962,

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autore- gressive visual generation.arXiv preprint arXiv:2506.10962,

work page arXiv

[26] [26]

Nfig: Au- toregressive image generation with next-frequency predic- tion.arXiv preprint arXiv:2503.07076, 2025

Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, and Xuelong Li. Nfig: Au- toregressive image generation with next-frequency predic- tion.arXiv preprint arXiv:2503.07076, 2025. 2

work page arXiv 2025

[27] [27]

Scfont: Structure-guided chinese font generation via deep stacked networks

Yue Jiang, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. Scfont: Structure-guided chinese font generation via deep stacked networks. InProceedings of the AAAI conference on artificial intelligence, pages 4015–4022, 2019. 2

work page 2019

[28] [28]

Legacy learning using few-shot font generation models for automatic text design in metaverse content: Cases studies in korean and chinese.arXiv preprint arXiv:2408.16900, 2024

Younghwi Kim, Seok Chan Jeong, and Sunghyun Sim. Legacy learning using few-shot font generation models for automatic text design in metaverse content: Cases studies in korean and chinese.arXiv preprint arXiv:2408.16900, 2024. 2

work page arXiv 2024

[29] [29]

Look closer to supervise better: One-shot font generation via component- based discriminator

Yuxin Kong, Canjie Luo, Weihong Ma, Qiyuan Zhu, Sheng- gao Zhu, Nicholas Yuan, and Lianwen Jin. Look closer to supervise better: One-shot font generation via component- based discriminator. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13482–13491, 2022. 1

work page 2022

[30] [30]

Fontadapter: Instant font adaptation in visual text generation.arXiv preprint arXiv:2506.05843, 2025

Myungkyu Koo, Subin Kim, Sangkyung Kwak, Jaehyun Nam, Seojin Kim, and Jinwoo Shin. Fontadapter: Instant font adaptation in visual text generation.arXiv preprint arXiv:2506.05843, 2025. 3

work page arXiv 2025

[31] [31]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 11523–11532, 2022. 2

work page 2022

[32] [32]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, pages 28541– 28564. Curran Associates, Inc., 2023. 2

work page 2023

[33] [33]

Hfh-font: Few-shot chinese font synthesis with higher quality, faster speed, and higher reso- lution.ACM Transactions on Graphics (TOG), 43(6):1–16,

Hua Li and Zhouhui Lian. Hfh-font: Few-shot chinese font synthesis with higher quality, faster speed, and higher reso- lution.ACM Transactions on Graphics (TOG), 43(6):1–16,

work page

[34] [34]

Fstdiff: One-shot font generation via cross-font style transformation learning

Shilin Li and Anna Zhu. Fstdiff: One-shot font generation via cross-font style transformation learning. InInternational Conference on Document Analysis and Recognition, pages 167–182. Springer, 2025. 2

work page 2025

[35] [35]

Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024. 2

work page arXiv 2024

[36] [36]

Cvfont: Synthesizing chinese vector fonts via deep layout inferring

Zhouhui Lian and Yichen Gao. Cvfont: Synthesizing chinese vector fonts via deep layout inferring. InComputer Graphics Forum, pages 212–225. Wiley Online Library, 2022. 2

work page 2022

[37] [37]

Easyfont: a style learning-based system to easily build your large-scale handwriting fonts.ACM Transactions on Graph- ics (TOG), 38(1):1–18, 2018

Zhouhui Lian, Bo Zhao, Xudong Chen, and Jianguo Xiao. Easyfont: a style learning-based system to easily build your large-scale handwriting fonts.ACM Transactions on Graph- ics (TOG), 38(1):1–18, 2018. 2

work page 2018

[38] [38]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Xmp- font: Self-supervised cross-modality pre-training for few- shot font generation

Wei Liu, Fangyue Liu, Fei Ding, Qian He, and Zili Yi. Xmp- font: Self-supervised cross-modality pre-training for few- shot font generation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7905–7914, 2022. 2

work page 2022

[40] [40]

Textadapter: Self-supervised domain adapta- tion for cross-domain text recognition.IEEE Transactions on Multimedia, 26:9854–9865, 2024

Xiao-Qian Liu, Peng-Fei Zhang, Xin Luo, Zi Huang, and Xin-Shun Xu. Textadapter: Self-supervised domain adapta- tion for cross-domain text recognition.IEEE Transactions on Multimedia, 26:9854–9865, 2024. 3

work page 2024

[41] [41]

Dualvector: Unsupervised vector font synthesis with dual-part represen- tation

Ying-Tian Liu, Zhifei Zhang, Yuan-Chen Guo, Matthew Fisher, Zhaowen Wang, and Song-Hai Zhang. Dualvector: Unsupervised vector font synthesis with dual-part represen- tation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14193–14202,

work page

[42] [42]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26439–26455, 2024. 2

work page 2024

[43] [43]

Callireader: Contextualizing chinese calligra- phy via an embedding-aligned vision-language model.arXiv preprint arXiv:2503.06472, 2025

Yuxuan Luo, Jiaqi Tang, Chenyi Huang, Feiyang Hao, and Zhouhui Lian. Callireader: Contextualizing chinese calligra- phy via an embedding-aligned vision-language model.arXiv preprint arXiv:2503.06472, 2025. 3

work page arXiv 2025

[44] [44]

Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024

Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li, Huaian Chen, and Yi Jin. Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024. 2

work page arXiv 2024

[45] [45]

Few shot font generation via transferring simi- larity guided global style and quantization local style

Wei Pan, Anna Zhu, Xinyu Zhou, Brian Kenji Iwana, and Shilin Li. Few shot font generation via transferring simi- larity guided global style and quantization local style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19506–19516, 2023. 2

work page 2023

[46] [46]

Few-shot font generation with localized style representations and factorization

Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, and Hyunjung Shim. Few-shot font generation with localized style representations and factorization. InProceedings of the AAAI conference on artificial intelligence, pages 2393–2402,

work page

[47] [47]

Multiple heads are better than one: Few- shot font generation with multiple localized experts

Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, and Hyunjung Shim. Multiple heads are better than one: Few- shot font generation with multiple localized experts. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 13900–13909, 2021. 2

work page 2021

[48] [48]

Im- age transformer

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- age transformer. InInternational conference on machine learning, pages 4055–4064. PMLR, 2018. 2

work page 2018

[49] [49]

Gener- ating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019. 2

work page 2019

[50] [50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Fonts: Text rendering with typography and style controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 18463–18474,

work page

[52] [52]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Emu: Generative Pretraining in Multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Few-shot font generation by learning fine-grained local styles

Licheng Tang, Yiyang Cai, Jiaming Liu, Zhibin Hong, Ming- ming Gong, Minhu Fan, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Few-shot font generation by learning fine-grained local styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7895–7904, 2022. 2, 3, 4

work page 2022

[55] [55]

Fontrnn: Generating large-scale chinese fonts via recurrent neural network

Shusen Tang, Zeqing Xia, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. Fontrnn: Generating large-scale chinese fonts via recurrent neural network. InComputer Graphics Forum, pages 567–577. Wiley Online Library, 2019. 2

work page 2019

[56] [56]

Vecfusion: Vector font gen- eration with diffusion

Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Micha¨el Gharbi, Oliver Wang, Alec Jacob- son, and Evangelos Kalogerakis. Vecfusion: Vector font gen- eration with diffusion. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7943–7952, 2024. 2

work page 2024

[57] [57]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 2

work page 2024

[58] [58]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 3

work page 2017

[59] [59]

Cf-font: Content fusion for few-shot font generation

Chi Wang, Min Zhou, Tiezheng Ge, Yuning Jiang, Hujun Bao, and Weiwei Xu. Cf-font: Content fusion for few-shot font generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1858– 1867, 2023. 5

work page 2023

[60] [60]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 2

work page arXiv 2025

[61] [61]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Deepvecfont: synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6):1–15, 2021

Yizhi Wang and Zhouhui Lian. Deepvecfont: synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6):1–15, 2021. 2

work page 2021

[63] [63]

Deepvecfont-v2: Exploiting transformers to synthesize vector fonts with higher quality

Yuqing Wang, Yizhi Wang, Longhui Yu, Yuesheng Zhu, and Zhouhui Lian. Deepvecfont-v2: Exploiting transformers to synthesize vector fonts with higher quality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18320–18328, 2023. 2

work page 2023

[64] [64]

Parallelized autoregressive visual generation

Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12955–12965, 2025. 2

work page 2025

[65] [65]

Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025

Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, and Zicheng Liu. Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025. 2

work page arXiv 2025

[66] [66]

Zigan: Fine-grained chinese calligraphy font generation via a few- shot style transfer approach

Qi Wen, Shuang Li, Bingfeng Han, and Yi Yuan. Zigan: Fine-grained chinese calligraphy font generation via a few- shot style transfer approach. InProceedings of the 29th ACM international conference on multimedia, pages 621– 629, 2021. 2

work page 2021

[67] [67]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Vecfontsdf: Learning to reconstruct and synthesize high-quality vec- tor fonts via signed distance functions

Zeqing Xia, Bojun Xiong, and Zhouhui Lian. Vecfontsdf: Learning to reconstruct and synthesize high-quality vec- tor fonts via signed distance functions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1848–1857, 2023. 2

work page 2023

[69] [69]

Palm2-vadapter: progressively aligned language model makes a strong vision-language adapter.arXiv preprint arXiv:2402.10896, 2024

Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, and Boyu Wang. Palm2-vadapter: progressively aligned language model makes a strong vision-language adapter.arXiv preprint arXiv:2402.10896, 2024. 3

work page arXiv 2024

[70] [70]

Dg- font: Deformable generative networks for unsupervised font generation

Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. Dg- font: Deformable generative networks for unsupervised font generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5130–5140,

work page

[71] [71]

Fontdiffuser: One-shot font genera- tion via denoising diffusion with multi-scale content aggre- gation and style contrastive learning

Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, and Lianwen Jin. Fontdiffuser: One-shot font genera- tion via denoising diffusion with multi-scale content aggre- gation and style contrastive learning. InProceedings of the AAAI conference on artificial intelligence, pages 6603–6611,

work page

[72] [72]

Vq-font: Few-shot font generation with structure-aware enhancement and quantization

Mingshuai Yao, Yabo Zhang, Xianhui Lin, Xiaoming Li, and Wangmeng Zuo. Vq-font: Few-shot font generation with structure-aware enhancement and quantization. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 16407–16415, 2024. 1, 2, 3, 5

work page 2024

[73] [73]

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[74] [74]

Scaling autoregressive multi- modal models: Pretraining and instruction tuning.arXiv preprint arXiv:2309.02591, 2023

Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi- modal models: Pretraining and instruction tuning.arXiv preprint arXiv:2309.02591, 2023. 2

work page arXiv 2023

[75] [75]

An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940– 128966, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940– 128966, 2024. 1, 2

work page 2024

[76] [76]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025. 2

work page 2025

[77] [77]

Language- guided image tokenization for generation

Kaiwen Zha, Lijun Yu, Alireza Fathi, David A Ross, Cordelia Schmid, Dina Katabi, and Xiuye Gu. Language- guided image tokenization for generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15713–15722, 2025. 2

work page 2025

[78] [79]

Separating style and content for generalized style transfer

Yexun Zhang, Ya Zhang, and Wenbin Cai. Separating style and content for generalized style transfer. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 8447–8455, 2018. 2

work page 2018

[79] [80]

Holistic tokenizer for autoregressive image generation.arXiv preprint arXiv:2507.02358, 2025

Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Holistic tokenizer for autoregressive image generation.arXiv preprint arXiv:2507.02358, 2025. 2

work page arXiv 2025