pith. sign in

arxiv: 2605.16384 · v1 · pith:GT4NOEMRnew · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

Pith reviewed 2026-05-20 22:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords adaptive image tokenizationglobal tokenspatch tokensdynamic token filteringcumulative conditional entropydiscrete representationsvision transformer efficiencyinformation density
0
0 comments X

The pith

Global tokens capture mutual information across image patches while dynamic filtering by cumulative conditional entropy removes redundancy for adaptive tokenization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that fixed-rate compression of images into patch tokens either loses essential details or keeps too much duplicate information because real images vary sharply in information density from region to region. It addresses the first problem by adding global tokens that explicitly model the information shared among patches and the second by introducing a filtering step that drops tokens whose removal changes little in the conditional entropy of the remaining set. A reader would care because this variable allocation promises both higher reconstruction accuracy and much lower compute cost when feeding the tokens into sequence models for generation or understanding. If the approach holds, long or high-resolution image sequences become feasible without proportional growth in token budget or processing time.

Core claim

The authors claim that patch tokens alone are informationally insufficient for faithful image reconstruction and that they contain substantial mutual redundancy. Global tokens are added to represent the information common across patches, creating a mutual enhancement effect, while a Dynamic Token Filtering procedure uses cumulative conditional entropy to decide which patch tokens can be safely discarded. The resulting system allocates tokens according to local information richness rather than a uniform rate, producing more compressed yet more accurate discrete representations.

What carries the argument

Global tokens that model mutual information shared among patch tokens, together with Dynamic Token Filtering that prunes patches according to cumulative conditional entropy.

If this is right

  • Token count per image becomes variable and proportional to actual information content instead of fixed.
  • Downstream sequence models receive fewer tokens on average, directly reducing inference latency.
  • Image generation and reconstruction metrics improve because shared context is explicitly preserved by the global tokens.
  • Long image sequences fit within practical context lengths without uniform downsampling.
  • The method supplies a concrete way to trade token budget against reconstruction error on a per-image basis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-plus-filter pattern could be tested on video frames where temporal redundancy varies across scenes.
  • Approximating the conditional entropy step with a lightweight network might allow real-time adaptive tokenization on edge hardware.
  • Token budgets in large vision transformers could be made dynamic at the sequence level rather than fixed per layer.
  • The mutual-enhancement idea suggests exploring bidirectional global-patch interactions inside the encoder itself.

Load-bearing premise

Global tokens must accurately encode the true mutual information present across patch tokens, and the entropy threshold must eliminate only redundant content without discarding task-critical visual details or introducing reconstruction errors.

What would settle it

Apply the tokenization to a test set of images containing fine-grained critical details such as small text or textures, then measure whether the filtered reconstructions show statistically significant drops in pixel-level fidelity or perceptual quality relative to the unfiltered patch-only baseline.

Figures

Figures reproduced from arXiv: 2605.16384 by Jun Zhao, Kang Liu, Xin Jiang, Xiusheng Huang, Yequan Wang.

Figure 1
Figure 1. Figure 1: The comprehensive Speed-Quality Compar￾ison of TaTok and Existing Approaches for ImageNet 256×256 Image Generation. The sampling speed is mea￾sured with an A100 GPU. the quantizer discretizes (Radford et al., 2019; Brown, 2020) for the decoder to reconstruct the image. Since reconstruction is trivial with unre￾stricted sequence length (e.g., retaining all RGB pixels) (Zhang et al., 2022; OpenAI, 2022), a h… view at source ↗
Figure 2
Figure 2. Figure 2: The overall process framework of TaTok. The input image is encoded into patch tokens via a vision [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The experimental results of reconstruction per [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The ablation study results comparing our [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization results of ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization results of ablation experiments. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization results with three types of semantic-feature-free images as inputs. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TaTok, a theoretically grounded adaptive image tokenization framework. It identifies two drawbacks in existing fixed-rate methods—inadequate information for reconstruction using patch tokens alone and redundancy among patch tokens—drawing on information entropy. The approach introduces global tokens to capture mutual information across patches and a Dynamic Token Filtering (DTF) algorithm that uses cumulative conditional entropy to prune redundant tokens, with experiments claiming state-of-the-art results including a 1.3x gFID improvement and 8.7x inference speedup.

Significance. If the central claims hold, the work could meaningfully advance efficient discrete tokenization for long image sequences by adapting token allocation to variable information density rather than uniform compression. The explicit linkage of global tokens to mutual information and entropy-based filtering offers a principled alternative to rigid patch tokenization, with potential downstream benefits for multimodal models; the reported speedups are practically relevant if reproducible.

major comments (3)
  1. [Theoretical Analysis] Theoretical Analysis section: the manuscript asserts a 'rigorous' identification of information insufficiency and redundancy but supplies no derivations, explicit mutual-information expressions, or proofs that global tokens reliably capture cross-patch dependencies; this is load-bearing for the claim that the method is theoretically grounded rather than heuristic.
  2. [DTF Algorithm] DTF Algorithm subsection: the cumulative conditional entropy estimator is not specified (e.g., whether it re-uses the tokenizer, employs a separate density model, or Monte-Carlo sampling), nor is invariance of the threshold across image distributions demonstrated; without this, the assertion that DTF removes redundancy without discarding task-critical high-frequency content cannot be evaluated and risks artifacts on textured or edge-case images.
  3. [Experimental Results] Experimental Results section: the 1.3x gFID and 8.7x speedup claims lack detailed baselines, dataset statistics, ablation on the entropy threshold, and per-image artifact analysis; this undermines the SOTA conclusion and leaves open the possibility that gains arise from threshold tuning rather than the proposed mutual-enhancement mechanism.
minor comments (2)
  1. [Notation] Ensure all notation for global tokens, patch tokens, and conditional entropy is defined at first use and used consistently.
  2. [Figures] Figure captions should explicitly state the image distributions and token budgets used so readers can assess generalizability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps strengthen the presentation of our work. We address each major comment point by point below. Revisions will be made to enhance theoretical clarity, algorithmic specification, and experimental reporting while preserving the core contributions of TaTok.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical Analysis section: the manuscript asserts a 'rigorous' identification of information insufficiency and redundancy but supplies no derivations, explicit mutual-information expressions, or proofs that global tokens reliably capture cross-patch dependencies; this is load-bearing for the claim that the method is theoretically grounded rather than heuristic.

    Authors: The identification of the two drawbacks draws directly from information-theoretic principles: patch tokens alone leave residual mutual information unmodeled (increasing conditional entropy for reconstruction), while fixed-rate allocation creates redundancy when local information density varies. Global tokens are introduced precisely to capture this cross-patch mutual information via joint attention. We acknowledge that the current manuscript presents these arguments at a conceptual level without explicit derivations or proofs. In the revised version we will add a dedicated paragraph in the Theoretical Analysis section containing the relevant mutual-information expressions (e.g., I(G; P) where G denotes global tokens and P the set of patch tokens) together with a short derivation showing how the addition of global tokens strictly reduces the total reconstruction entropy. This will make the theoretical grounding explicit rather than implicit. revision: yes

  2. Referee: [DTF Algorithm] DTF Algorithm subsection: the cumulative conditional entropy estimator is not specified (e.g., whether it re-uses the tokenizer, employs a separate density model, or Monte-Carlo sampling), nor is invariance of the threshold across image distributions demonstrated; without this, the assertion that DTF removes redundancy without discarding task-critical high-frequency content cannot be evaluated and risks artifacts on textured or edge-case images.

    Authors: We agree that the precise implementation of the cumulative conditional entropy estimator must be stated unambiguously. The estimator re-uses the tokenizer’s own next-token probability distribution to compute conditional entropies sequentially; no separate density model or Monte-Carlo sampling is employed. The threshold is chosen as a fixed percentile of the per-image entropy distribution observed on the training set. In the revision we will expand the DTF subsection with the exact algorithmic steps, pseudocode, and a short paragraph analyzing threshold stability across image distributions (natural scenes, textures, and synthetic data). We will also add qualitative reconstruction examples on high-frequency images to demonstrate that critical detail is retained. revision: yes

  3. Referee: [Experimental Results] Experimental Results section: the 1.3x gFID and 8.7x speedup claims lack detailed baselines, dataset statistics, ablation on the entropy threshold, and per-image artifact analysis; this undermines the SOTA conclusion and leaves open the possibility that gains arise from threshold tuning rather than the proposed mutual-enhancement mechanism.

    Authors: The reported gains are measured against standard fixed-rate baselines (VQGAN, VQVAE-2, and recent adaptive tokenizers) on ImageNet and COCO validation sets. We recognize that additional transparency is required. The revised Experimental Results section will include: (i) exact hyper-parameter settings and training details for all baselines, (ii) dataset statistics (image counts, resolution distributions), (iii) a dedicated ablation table varying the entropy threshold and reporting both gFID and inference latency, and (iv) per-image qualitative comparisons highlighting artifact patterns (or their absence) on textured and edge-case images. These additions will allow readers to verify that performance improvements originate from the mutual-enhancement design rather than threshold selection alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper frames TaTok as inspired by information entropy and introduces global tokens plus DTF based on cumulative conditional entropy to address identified drawbacks in patch-token methods. No equations or steps are shown that reduce the claimed predictions or performance gains directly to fitted thresholds or self-citations by construction. The central proposal (adaptive allocation via mutual information modeling and entropy-based filtering) retains independent content beyond its inputs, with experiments presented as empirical confirmation rather than definitional. This is the most common honest outcome for a method paper that proposes a new framework without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on standard information-theoretic assumptions about variable image entropy and introduces two new components whose independent validation is not shown in the provided abstract.

axioms (1)
  • domain assumption Images possess variable information density that can be quantified and exploited via concepts from information entropy.
    Explicitly stated as the inspiration for the adaptive tokenization approach.
invented entities (2)
  • Global tokens no independent evidence
    purpose: Model mutual information across patch tokens to address information insufficiency.
    Introduced to fix the claimed drawback of patch tokens alone.
  • Dynamic Token Filtering (DTF) algorithm no independent evidence
    purpose: Eliminate redundancy among patch tokens using cumulative conditional entropy.
    Proposed to fix the claimed redundancy drawback.

pith-pipeline@v0.9.0 · 5692 in / 1271 out tokens · 38989 ms · 2026-05-20T22:38:54.074896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 25 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Advances in Neural Information Processing Systems , year=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

  9. [9]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  10. [10]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  11. [11]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

  12. [12]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  13. [13]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  14. [14]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  15. [15]

    Scaling Laws for Autoregressive Generative Modeling

    Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , year=

  16. [16]

    Vector-quantized Image Modeling with Improved VQGAN

    Vector-quantized image modeling with improved vqgan , author=. arXiv preprint arXiv:2110.04627 , year=

  17. [17]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , volume=

  18. [18]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  19. [19]

    Classifier-Free Diffusion Guidance

    Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

  20. [20]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Language Model Beats Diffusion--Tokenizer is Key to Visual Generation , author=. arXiv preprint arXiv:2310.05737 , year=

  21. [21]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  22. [22]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Eyes wide shut? exploring the visual shortcomings of multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  23. [23]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

  24. [24]

    International journal of computer vision , volume=

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International journal of computer vision , volume=. 2020 , publisher=

  25. [25]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  26. [26]

    Journal of Machine Learning Research , volume=

    Cascaded diffusion models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

  27. [27]

    Proceedings of the SIGGRAPH Conference

    Scaling StyleGAN to Large Diverse Datasets , author=. Proceedings of the SIGGRAPH Conference. ACM , pages=

  28. [28]

    Advances in neural information processing systems , volume=

    Neural discrete representation learning , author=. Advances in neural information processing systems , volume=

  29. [29]

    Advances in neural information processing systems , volume=

    Generating diverse high-fidelity images with vq-vae-2 , author=. Advances in neural information processing systems , volume=

  30. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Autoregressive image generation using residual quantization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  31. [31]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  32. [32]

    Advances in Neural Information Processing Systems , volume=

    Movq: Modulating quantized vectors for high-fidelity image generation , author=. Advances in Neural Information Processing Systems , volume=

  33. [33]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

  34. [34]

    Generating images with sparse representations

    Generating images with sparse representations , author=. arXiv preprint arXiv:2103.03841 , year=

  35. [35]

    Advances in neural information processing systems , volume=

    Improved precision and recall metric for assessing generative models , author=. Advances in neural information processing systems , volume=

  36. [36]

    Advances in neural information processing systems , volume=

    Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

  37. [37]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Rethinking the inception architecture for computer vision , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  38. [38]

    Advances in neural information processing systems , volume=

    Improved techniques for training gans , author=. Advances in neural information processing systems , volume=

  39. [39]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

    Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation , author=. arXiv preprint arXiv:2409.04410 , year=

  40. [40]

    International conference on machine learning , pages=

    Zero-shot text-to-image generation , author=. International conference on machine learning , pages=. 2021 , organization=

  41. [41]

    science , volume=

    Reducing the dimensionality of data with neural networks , author=. science , volume=. 2006 , publisher=

  42. [42]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

  43. [43]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

  44. [44]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Image-to-image translation with conditional adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  45. [45]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  46. [46]

    arXiv preprint arXiv:2406.11837 , year=

    Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99\ author=. arXiv preprint arXiv:2406.11837 , year=

  47. [47]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Finite scalar quantization: Vq-vae made simple , author=. arXiv preprint arXiv:2309.15505 , year=

  48. [48]

    arXiv preprint arXiv:2406.07550 , year=

    An Image is Worth 32 Tokens for Reconstruction and Generation , author=. arXiv preprint arXiv:2406.07550 , year=

  49. [49]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  50. [50]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  51. [51]

    Improving language understanding by generative pre-training , author=

  52. [52]

    International conference on machine learning , pages=

    Generative pretraining from pixels , author=. International conference on machine learning , pages=. 2020 , organization=

  53. [53]

    2021 , eprint=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2021 , eprint=

  54. [54]

    Advances in neural information processing systems , volume=

    Conditional image generation with pixelcnn decoders , author=. Advances in neural information processing systems , volume=

  55. [55]

    2024.doi:10.48550/arXiv.2404.02905

    Visual autoregressive modeling: Scalable image generation via next-scale prediction , author=. arXiv preprint arXiv:2404.02905 , year=

  56. [56]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , author=. arXiv preprint arXiv:2406.06525 , year=

  57. [57]

    Advances in neural information processing systems , volume=

    Generative adversarial nets , author=. Advances in neural information processing systems , volume=

  58. [58]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Large Scale GAN Training for High Fidelity Natural Image Synthesis , author=. arXiv preprint arXiv:1809.11096 , year=

  59. [59]

    2009 IEEE conference on computer vision and pattern recognition , pages=

    Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

  60. [60]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    A style-based generator architecture for generative adversarial networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  61. [61]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  62. [62]

    Proceedings of the 25th international conference on Machine learning , pages=

    Extracting and composing robust features with denoising autoencoders , author=. Proceedings of the 25th international conference on Machine learning , pages=

  63. [63]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  64. [64]

    Advances in neural information processing systems , volume=

    Generative modeling by estimating gradients of the data distribution , author=. Advances in neural information processing systems , volume=

  65. [65]

    Denoising Diffusion Implicit Models

    Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=

  66. [66]

    Advances in neural information processing systems , volume=

    Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

  67. [67]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  68. [68]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Magvit: Masked generative video transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  69. [69]

    Auto-Encoding Variational Bayes

    Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

  70. [70]

    Proceedings of the International Conference on Computer Vision (ICCV) , year=

    Emerging Properties in Self-Supervised Vision Transformers , author=. Proceedings of the International Conference on Computer Vision (ICCV) , year=

  71. [71]

    DINOv2: Learning Robust Visual Features without Supervision

    DINOv2: Learning Robust Visual Features without Supervision , author=. arXiv:2304.07193 , year=

  72. [72]

    Vision Transformers Need Registers

    Vision Transformers Need Registers , author=. arXiv:2309.16588 , year=

  73. [73]

    Improved Baselines with Visual Instruction Tuning , author=

  74. [74]

    Visual Instruction Tuning , author=

  75. [75]

    Runpei Dong and Chunrui Han and Yuang Peng and Zekun Qi and Zheng Ge and Jinrong Yang and Liang Zhao and Jianjian Sun and Hongyu Zhou and Haoran Wei and Xiangwen Kong and Xiangyu Zhang and Kaisheng Ma and Li Yi , booktitle=. Dream. 2024 , url=

  76. [76]

    2024 , eprint=

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=

  77. [77]

    2023 , eprint=

    CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=

  78. [78]

    Label- efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126,

    Label-efficient semantic segmentation with diffusion models , author=. arXiv preprint arXiv:2112.03126 , year=

  79. [79]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Feature pyramid networks for object detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  80. [80]

    Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

    Is synthetic data from generative models ready for image recognition? , author=. arXiv preprint arXiv:2210.07574 , year=

Showing first 80 references.