Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

Jun Zhao; Kang Liu; Xin Jiang; Xiusheng Huang; Yequan Wang

arxiv: 2605.16384 · v1 · pith:GT4NOEMRnew · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

Xiusheng Huang , Xin Jiang , Jun Zhao , Kang Liu , Yequan Wang This is my paper

Pith reviewed 2026-05-20 22:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords adaptive image tokenizationglobal tokenspatch tokensdynamic token filteringcumulative conditional entropydiscrete representationsvision transformer efficiencyinformation density

0 comments

The pith

Global tokens capture mutual information across image patches while dynamic filtering by cumulative conditional entropy removes redundancy for adaptive tokenization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that fixed-rate compression of images into patch tokens either loses essential details or keeps too much duplicate information because real images vary sharply in information density from region to region. It addresses the first problem by adding global tokens that explicitly model the information shared among patches and the second by introducing a filtering step that drops tokens whose removal changes little in the conditional entropy of the remaining set. A reader would care because this variable allocation promises both higher reconstruction accuracy and much lower compute cost when feeding the tokens into sequence models for generation or understanding. If the approach holds, long or high-resolution image sequences become feasible without proportional growth in token budget or processing time.

Core claim

The authors claim that patch tokens alone are informationally insufficient for faithful image reconstruction and that they contain substantial mutual redundancy. Global tokens are added to represent the information common across patches, creating a mutual enhancement effect, while a Dynamic Token Filtering procedure uses cumulative conditional entropy to decide which patch tokens can be safely discarded. The resulting system allocates tokens according to local information richness rather than a uniform rate, producing more compressed yet more accurate discrete representations.

What carries the argument

Global tokens that model mutual information shared among patch tokens, together with Dynamic Token Filtering that prunes patches according to cumulative conditional entropy.

If this is right

Token count per image becomes variable and proportional to actual information content instead of fixed.
Downstream sequence models receive fewer tokens on average, directly reducing inference latency.
Image generation and reconstruction metrics improve because shared context is explicitly preserved by the global tokens.
Long image sequences fit within practical context lengths without uniform downsampling.
The method supplies a concrete way to trade token budget against reconstruction error on a per-image basis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-plus-filter pattern could be tested on video frames where temporal redundancy varies across scenes.
Approximating the conditional entropy step with a lightweight network might allow real-time adaptive tokenization on edge hardware.
Token budgets in large vision transformers could be made dynamic at the sequence level rather than fixed per layer.
The mutual-enhancement idea suggests exploring bidirectional global-patch interactions inside the encoder itself.

Load-bearing premise

Global tokens must accurately encode the true mutual information present across patch tokens, and the entropy threshold must eliminate only redundant content without discarding task-critical visual details or introducing reconstruction errors.

What would settle it

Apply the tokenization to a test set of images containing fine-grained critical details such as small text or textures, then measure whether the filtered reconstructions show statistically significant drops in pixel-level fidelity or perceptual quality relative to the unfiltered patch-only baseline.

Figures

Figures reproduced from arXiv: 2605.16384 by Jun Zhao, Kang Liu, Xin Jiang, Xiusheng Huang, Yequan Wang.

**Figure 1.** Figure 1: The comprehensive Speed-Quality Comparison of TaTok and Existing Approaches for ImageNet 256×256 Image Generation. The sampling speed is measured with an A100 GPU. the quantizer discretizes (Radford et al., 2019; Brown, 2020) for the decoder to reconstruct the image. Since reconstruction is trivial with unrestricted sequence length (e.g., retaining all RGB pixels) (Zhang et al., 2022; OpenAI, 2022), a h… view at source ↗

**Figure 2.** Figure 2: The overall process framework of TaTok. The input image is encoded into patch tokens via a vision [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: The experimental results of reconstruction per [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The ablation study results comparing our [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization results of ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization results of ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization results with three types of semantic-feature-free images as inputs. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TaTok, a theoretically grounded adaptive image tokenization framework. It identifies two drawbacks in existing fixed-rate methods—inadequate information for reconstruction using patch tokens alone and redundancy among patch tokens—drawing on information entropy. The approach introduces global tokens to capture mutual information across patches and a Dynamic Token Filtering (DTF) algorithm that uses cumulative conditional entropy to prune redundant tokens, with experiments claiming state-of-the-art results including a 1.3x gFID improvement and 8.7x inference speedup.

Significance. If the central claims hold, the work could meaningfully advance efficient discrete tokenization for long image sequences by adapting token allocation to variable information density rather than uniform compression. The explicit linkage of global tokens to mutual information and entropy-based filtering offers a principled alternative to rigid patch tokenization, with potential downstream benefits for multimodal models; the reported speedups are practically relevant if reproducible.

major comments (3)

[Theoretical Analysis] Theoretical Analysis section: the manuscript asserts a 'rigorous' identification of information insufficiency and redundancy but supplies no derivations, explicit mutual-information expressions, or proofs that global tokens reliably capture cross-patch dependencies; this is load-bearing for the claim that the method is theoretically grounded rather than heuristic.
[DTF Algorithm] DTF Algorithm subsection: the cumulative conditional entropy estimator is not specified (e.g., whether it re-uses the tokenizer, employs a separate density model, or Monte-Carlo sampling), nor is invariance of the threshold across image distributions demonstrated; without this, the assertion that DTF removes redundancy without discarding task-critical high-frequency content cannot be evaluated and risks artifacts on textured or edge-case images.
[Experimental Results] Experimental Results section: the 1.3x gFID and 8.7x speedup claims lack detailed baselines, dataset statistics, ablation on the entropy threshold, and per-image artifact analysis; this undermines the SOTA conclusion and leaves open the possibility that gains arise from threshold tuning rather than the proposed mutual-enhancement mechanism.

minor comments (2)

[Notation] Ensure all notation for global tokens, patch tokens, and conditional entropy is defined at first use and used consistently.
[Figures] Figure captions should explicitly state the image distributions and token budgets used so readers can assess generalizability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps strengthen the presentation of our work. We address each major comment point by point below. Revisions will be made to enhance theoretical clarity, algorithmic specification, and experimental reporting while preserving the core contributions of TaTok.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical Analysis section: the manuscript asserts a 'rigorous' identification of information insufficiency and redundancy but supplies no derivations, explicit mutual-information expressions, or proofs that global tokens reliably capture cross-patch dependencies; this is load-bearing for the claim that the method is theoretically grounded rather than heuristic.

Authors: The identification of the two drawbacks draws directly from information-theoretic principles: patch tokens alone leave residual mutual information unmodeled (increasing conditional entropy for reconstruction), while fixed-rate allocation creates redundancy when local information density varies. Global tokens are introduced precisely to capture this cross-patch mutual information via joint attention. We acknowledge that the current manuscript presents these arguments at a conceptual level without explicit derivations or proofs. In the revised version we will add a dedicated paragraph in the Theoretical Analysis section containing the relevant mutual-information expressions (e.g., I(G; P) where G denotes global tokens and P the set of patch tokens) together with a short derivation showing how the addition of global tokens strictly reduces the total reconstruction entropy. This will make the theoretical grounding explicit rather than implicit. revision: yes
Referee: [DTF Algorithm] DTF Algorithm subsection: the cumulative conditional entropy estimator is not specified (e.g., whether it re-uses the tokenizer, employs a separate density model, or Monte-Carlo sampling), nor is invariance of the threshold across image distributions demonstrated; without this, the assertion that DTF removes redundancy without discarding task-critical high-frequency content cannot be evaluated and risks artifacts on textured or edge-case images.

Authors: We agree that the precise implementation of the cumulative conditional entropy estimator must be stated unambiguously. The estimator re-uses the tokenizer’s own next-token probability distribution to compute conditional entropies sequentially; no separate density model or Monte-Carlo sampling is employed. The threshold is chosen as a fixed percentile of the per-image entropy distribution observed on the training set. In the revision we will expand the DTF subsection with the exact algorithmic steps, pseudocode, and a short paragraph analyzing threshold stability across image distributions (natural scenes, textures, and synthetic data). We will also add qualitative reconstruction examples on high-frequency images to demonstrate that critical detail is retained. revision: yes
Referee: [Experimental Results] Experimental Results section: the 1.3x gFID and 8.7x speedup claims lack detailed baselines, dataset statistics, ablation on the entropy threshold, and per-image artifact analysis; this undermines the SOTA conclusion and leaves open the possibility that gains arise from threshold tuning rather than the proposed mutual-enhancement mechanism.

Authors: The reported gains are measured against standard fixed-rate baselines (VQGAN, VQVAE-2, and recent adaptive tokenizers) on ImageNet and COCO validation sets. We recognize that additional transparency is required. The revised Experimental Results section will include: (i) exact hyper-parameter settings and training details for all baselines, (ii) dataset statistics (image counts, resolution distributions), (iii) a dedicated ablation table varying the entropy threshold and reporting both gFID and inference latency, and (iv) per-image qualitative comparisons highlighting artifact patterns (or their absence) on textured and edge-case images. These additions will allow readers to verify that performance improvements originate from the mutual-enhancement design rather than threshold selection alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper frames TaTok as inspired by information entropy and introduces global tokens plus DTF based on cumulative conditional entropy to address identified drawbacks in patch-token methods. No equations or steps are shown that reduce the claimed predictions or performance gains directly to fitted thresholds or self-citations by construction. The central proposal (adaptive allocation via mutual information modeling and entropy-based filtering) retains independent content beyond its inputs, with experiments presented as empirical confirmation rather than definitional. This is the most common honest outcome for a method paper that proposes a new framework without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on standard information-theoretic assumptions about variable image entropy and introduces two new components whose independent validation is not shown in the provided abstract.

axioms (1)

domain assumption Images possess variable information density that can be quantified and exploited via concepts from information entropy.
Explicitly stated as the inspiration for the adaptive tokenization approach.

invented entities (2)

Global tokens no independent evidence
purpose: Model mutual information across patch tokens to address information insufficiency.
Introduced to fix the claimed drawback of patch tokens alone.
Dynamic Token Filtering (DTF) algorithm no independent evidence
purpose: Eliminate redundancy among patch tokens using cumulative conditional entropy.
Proposed to fix the claimed redundancy drawback.

pith-pipeline@v0.9.0 · 5692 in / 1271 out tokens · 38989 ms · 2026-05-20T22:38:54.074896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 25 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

work page
[9]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[10]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[11]

Language Models are Few-Shot Learners

Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[12]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

Scaling Laws for Autoregressive Generative Modeling

Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Vector-quantized Image Modeling with Improved VQGAN

Vector-quantized image modeling with improved vqgan , author=. arXiv preprint arXiv:2110.04627 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[19]

Classifier-Free Diffusion Guidance

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Language Model Beats Diffusion--Tokenizer is Key to Visual Generation , author=. arXiv preprint arXiv:2310.05737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Eyes wide shut? exploring the visual shortcomings of multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[23]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[24]

International journal of computer vision , volume=

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International journal of computer vision , volume=. 2020 , publisher=

work page 2020
[25]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[26]

Journal of Machine Learning Research , volume=

Cascaded diffusion models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

work page
[27]

Proceedings of the SIGGRAPH Conference

Scaling StyleGAN to Large Diverse Datasets , author=. Proceedings of the SIGGRAPH Conference. ACM , pages=

work page
[28]

Advances in neural information processing systems , volume=

Neural discrete representation learning , author=. Advances in neural information processing systems , volume=

work page
[29]

Advances in neural information processing systems , volume=

Generating diverse high-fidelity images with vq-vae-2 , author=. Advances in neural information processing systems , volume=

work page
[30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Autoregressive image generation using residual quantization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[31]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[32]

Advances in Neural Information Processing Systems , volume=

Movq: Modulating quantized vectors for high-fidelity image generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Generating images with sparse representations

Generating images with sparse representations , author=. arXiv preprint arXiv:2103.03841 , year=

work page arXiv
[35]

Advances in neural information processing systems , volume=

Improved precision and recall metric for assessing generative models , author=. Advances in neural information processing systems , volume=

work page
[36]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page
[37]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Rethinking the inception architecture for computer vision , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[38]

Advances in neural information processing systems , volume=

Improved techniques for training gans , author=. Advances in neural information processing systems , volume=

work page
[39]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation , author=. arXiv preprint arXiv:2409.04410 , year=

work page arXiv
[40]

International conference on machine learning , pages=

Zero-shot text-to-image generation , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[41]

science , volume=

Reducing the dimensionality of data with neural networks , author=. science , volume=. 2006 , publisher=

work page 2006
[42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Image-to-image translation with conditional adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[45]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[46]

arXiv preprint arXiv:2406.11837 , year=

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99\ author=. arXiv preprint arXiv:2406.11837 , year=

work page arXiv
[47]

Finite Scalar Quantization: VQ-VAE Made Simple

Finite scalar quantization: Vq-vae made simple , author=. arXiv preprint arXiv:2309.15505 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2406.07550 , year=

An Image is Worth 32 Tokens for Reconstruction and Generation , author=. arXiv preprint arXiv:2406.07550 , year=

work page arXiv
[49]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[51]

Improving language understanding by generative pre-training , author=

work page
[52]

International conference on machine learning , pages=

Generative pretraining from pixels , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[53]

2021 , eprint=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2021 , eprint=

work page 2021
[54]

Advances in neural information processing systems , volume=

Conditional image generation with pixelcnn decoders , author=. Advances in neural information processing systems , volume=

work page
[55]

2024.doi:10.48550/arXiv.2404.02905

Visual autoregressive modeling: Scalable image generation via next-scale prediction , author=. arXiv preprint arXiv:2404.02905 , year=

work page arXiv
[56]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , author=. arXiv preprint arXiv:2406.06525 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in neural information processing systems , volume=

Generative adversarial nets , author=. Advances in neural information processing systems , volume=

work page
[58]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Large Scale GAN Training for High Fidelity Natural Image Synthesis , author=. arXiv preprint arXiv:1809.11096 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[60]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A style-based generator architecture for generative adversarial networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[62]

Proceedings of the 25th international conference on Machine learning , pages=

Extracting and composing robust features with denoising autoencoders , author=. Proceedings of the 25th international conference on Machine learning , pages=

work page
[63]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[64]

Advances in neural information processing systems , volume=

Generative modeling by estimating gradients of the data distribution , author=. Advances in neural information processing systems , volume=

work page
[65]

Denoising Diffusion Implicit Models

Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[66]

Advances in neural information processing systems , volume=

Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

work page
[67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Magvit: Masked generative video transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[69]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Proceedings of the International Conference on Computer Vision (ICCV) , year=

Emerging Properties in Self-Supervised Vision Transformers , author=. Proceedings of the International Conference on Computer Vision (ICCV) , year=

work page
[71]

DINOv2: Learning Robust Visual Features without Supervision

DINOv2: Learning Robust Visual Features without Supervision , author=. arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Vision Transformers Need Registers

Vision Transformers Need Registers , author=. arXiv:2309.16588 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Improved Baselines with Visual Instruction Tuning , author=

work page
[74]

Visual Instruction Tuning , author=

work page
[75]

Runpei Dong and Chunrui Han and Yuang Peng and Zekun Qi and Zheng Ge and Jinrong Yang and Liang Zhao and Jianjian Sun and Hongyu Zhou and Haoran Wei and Xiangwen Kong and Xiangyu Zhang and Kaisheng Ma and Li Yi , booktitle=. Dream. 2024 , url=

work page 2024
[76]

2024 , eprint=

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=

work page 2024
[77]

2023 , eprint=

CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=

work page 2023
[78]

Label- efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126,

Label-efficient semantic segmentation with diffusion models , author=. arXiv preprint arXiv:2112.03126 , year=

work page arXiv
[79]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Feature pyramid networks for object detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[80]

Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

Is synthetic data from generative models ready for image recognition? , author=. arXiv preprint arXiv:2210.07574 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

work page

[9] [9]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page

[10] [10]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page

[11] [11]

Language Models are Few-Shot Learners

Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005

[12] [12]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[15] [15]

Scaling Laws for Autoregressive Generative Modeling

Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[16] [16]

Vector-quantized Image Modeling with Improved VQGAN

Vector-quantized image modeling with improved vqgan , author=. arXiv preprint arXiv:2110.04627 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024

[19] [19]

Classifier-Free Diffusion Guidance

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Language Model Beats Diffusion--Tokenizer is Key to Visual Generation , author=. arXiv preprint arXiv:2310.05737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Eyes wide shut? exploring the visual shortcomings of multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[23] [23]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[24] [24]

International journal of computer vision , volume=

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International journal of computer vision , volume=. 2020 , publisher=

work page 2020

[25] [25]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[26] [26]

Journal of Machine Learning Research , volume=

Cascaded diffusion models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

work page

[27] [27]

Proceedings of the SIGGRAPH Conference

Scaling StyleGAN to Large Diverse Datasets , author=. Proceedings of the SIGGRAPH Conference. ACM , pages=

work page

[28] [28]

Advances in neural information processing systems , volume=

Neural discrete representation learning , author=. Advances in neural information processing systems , volume=

work page

[29] [29]

Advances in neural information processing systems , volume=

Generating diverse high-fidelity images with vq-vae-2 , author=. Advances in neural information processing systems , volume=

work page

[30] [30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Autoregressive image generation using residual quantization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[31] [31]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[32] [32]

Advances in Neural Information Processing Systems , volume=

Movq: Modulating quantized vectors for high-fidelity image generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[33] [33]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Generating images with sparse representations

Generating images with sparse representations , author=. arXiv preprint arXiv:2103.03841 , year=

work page arXiv

[35] [35]

Advances in neural information processing systems , volume=

Improved precision and recall metric for assessing generative models , author=. Advances in neural information processing systems , volume=

work page

[36] [36]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page

[37] [37]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Rethinking the inception architecture for computer vision , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[38] [38]

Advances in neural information processing systems , volume=

Improved techniques for training gans , author=. Advances in neural information processing systems , volume=

work page

[39] [39]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation , author=. arXiv preprint arXiv:2409.04410 , year=

work page arXiv

[40] [40]

International conference on machine learning , pages=

Zero-shot text-to-image generation , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[41] [41]

science , volume=

Reducing the dimensionality of data with neural networks , author=. science , volume=. 2006 , publisher=

work page 2006

[42] [42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Image-to-image translation with conditional adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[45] [45]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[46] [46]

arXiv preprint arXiv:2406.11837 , year=

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99\ author=. arXiv preprint arXiv:2406.11837 , year=

work page arXiv

[47] [47]

Finite Scalar Quantization: VQ-VAE Made Simple

Finite scalar quantization: Vq-vae made simple , author=. arXiv preprint arXiv:2309.15505 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2406.07550 , year=

An Image is Worth 32 Tokens for Reconstruction and Generation , author=. arXiv preprint arXiv:2406.07550 , year=

work page arXiv

[49] [49]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[51] [51]

Improving language understanding by generative pre-training , author=

work page

[52] [52]

International conference on machine learning , pages=

Generative pretraining from pixels , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[53] [53]

2021 , eprint=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2021 , eprint=

work page 2021

[54] [54]

Advances in neural information processing systems , volume=

Conditional image generation with pixelcnn decoders , author=. Advances in neural information processing systems , volume=

work page

[55] [55]

2024.doi:10.48550/arXiv.2404.02905

Visual autoregressive modeling: Scalable image generation via next-scale prediction , author=. arXiv preprint arXiv:2404.02905 , year=

work page arXiv

[56] [56]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , author=. arXiv preprint arXiv:2406.06525 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Advances in neural information processing systems , volume=

Generative adversarial nets , author=. Advances in neural information processing systems , volume=

work page

[58] [58]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Large Scale GAN Training for High Fidelity Natural Image Synthesis , author=. arXiv preprint arXiv:1809.11096 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009

[60] [60]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A style-based generator architecture for generative adversarial networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[61] [61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[62] [62]

Proceedings of the 25th international conference on Machine learning , pages=

Extracting and composing robust features with denoising autoencoders , author=. Proceedings of the 25th international conference on Machine learning , pages=

work page

[63] [63]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page

[64] [64]

Advances in neural information processing systems , volume=

Generative modeling by estimating gradients of the data distribution , author=. Advances in neural information processing systems , volume=

work page

[65] [65]

Denoising Diffusion Implicit Models

Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[66] [66]

Advances in neural information processing systems , volume=

Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

work page

[67] [67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[68] [68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Magvit: Masked generative video transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[69] [69]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Proceedings of the International Conference on Computer Vision (ICCV) , year=

Emerging Properties in Self-Supervised Vision Transformers , author=. Proceedings of the International Conference on Computer Vision (ICCV) , year=

work page

[71] [71]

DINOv2: Learning Robust Visual Features without Supervision

DINOv2: Learning Robust Visual Features without Supervision , author=. arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

Vision Transformers Need Registers

Vision Transformers Need Registers , author=. arXiv:2309.16588 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [73]

Improved Baselines with Visual Instruction Tuning , author=

work page

[74] [74]

Visual Instruction Tuning , author=

work page

[75] [75]

Runpei Dong and Chunrui Han and Yuang Peng and Zekun Qi and Zheng Ge and Jinrong Yang and Liang Zhao and Jianjian Sun and Hongyu Zhou and Haoran Wei and Xiangwen Kong and Xiangyu Zhang and Kaisheng Ma and Li Yi , booktitle=. Dream. 2024 , url=

work page 2024

[76] [76]

2024 , eprint=

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=

work page 2024

[77] [77]

2023 , eprint=

CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=

work page 2023

[78] [78]

Label- efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126,

Label-efficient semantic segmentation with diffusion models , author=. arXiv preprint arXiv:2112.03126 , year=

work page arXiv

[79] [79]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Feature pyramid networks for object detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[80] [80]

Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

Is synthetic data from generative models ready for image recognition? , author=. arXiv preprint arXiv:2210.07574 , year=

work page arXiv