pith. sign in

arxiv: 2604.11089 · v2 · pith:EIGDAX7Znew · submitted 2026-04-13 · 💻 cs.CV

Structured State-Space Regularization for Generation-Friendly Image Tokenization

Pith reviewed 2026-05-21 00:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords latentimageregularizationspectralstate-spacestructurecapturecomponents
0
0 comments X

The pith

Structured state-space regularization induces spectral structure in image tokenizer latent spaces via an SSM-derived objective, improving generative performance with minimal reconstruction loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Image tokenizers convert pictures into sequences of tokens that generative AI models use to create new images. The organization of information inside the tokenizer's latent space strongly affects how well those models can generate coherent and detailed outputs. The authors revisit state-space models, which are systems designed to process sequences while naturally handling different frequency components, and reinterpret them as mimicking basis functions. This leads to a regularization term that encourages the hidden states to capture frequency information from the input images. The resulting regularizer is added to the tokenizer training process to promote spectral organization without requiring major architectural changes. Experiments reported in the abstract show gains in generative metrics when the regularizer is used, while reconstruction accuracy remains nearly unchanged. The approach is presented as principled because it follows directly from the frequency-capturing property of the SSM perspective rather than from ad-hoc tuning.

Core claim

Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.

Load-bearing premise

That revisiting state-space models as systems mimicking a basis function's behavior induces hidden states to capture frequency components that can be transferred via regularization to enforce useful spectral structure in image tokenizer latent spaces.

Figures

Figures reproduced from arXiv: 2604.11089 by Byung-Jun Yoon, Dongwon Kim, Jaemin Oh, Jinsung Lee, Namhun Kim, Suha Kwak.

Figure 1
Figure 1. Figure 1: Different choice of c(·), θ(·)  and the resulting update rules. Existing SSMs update their hidden state based on Eq. (4), which is derived from the coefficient dynamics based on the choice (a). One can choose a different combination of c(·) and θ(·) such as (b), to derive new dynamics and formulate a new SSM framework. 4.2 Structured state-space regularization Building on this philosophy, we propose struc… view at source ↗
Figure 2
Figure 2. Figure 2: State-space regularization applied to an image tokenizer. With probability α, the update Eq. (21) is applied to the latent representation of the input images to produce zˆ2 and ˆIτ2 . The network is trained to match (zˆ2, ˆIτ2 ) to (z2, Iτ2 ). endow basis-like inductive bias to the encoder output E(I), we let E follow the derived update rule over the predefined transformation {It}: \label {eq:whippo_update… view at source ↗
Figure 3
Figure 3. Figure 3: Generation results comparison using the Flux tokenizer. With only marginal loss in reconstruction quality, our method improves the generative performance of the image tokenizer. generation. The results on Cosmos tokenizer show a similar trend, except that all listed regularizers improve reconstruction quality. Our method yields smaller gains in PSNR and LPIPS, while achieving comparable performance in SSIM… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of spatial mean-centering function C. The one trained without C shows latents with higher latent norm, and exhibits tremendous degree of KL loss [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Latent channel dynamics resembling the coefficient dynamics. The top row shows the blurring sequence {It} 9 t=0 of an image from the ImageNet validation set, and the next four rows show how the first four channels of the corresponding latent features change as the image blurs. We additionally provide how Fourier coefficients evolve for reference [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Latent channels encoding different frequency bands. We visualize this by progressively unmasking latent channels in low-to-high frequency order and decoding them into images. The regularized tokenizer is able to reconstruct recognizable images solely from using three low-frequency channels. 5.5 Latent structures emerging from the regularization Note that we regularize the latent space by enforcing a basis … view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of derived A matrices. We set H = W = 8. Every matrix we derived is very sparse, which enables efficient matrix-vector multiplication. To maintain training stability, we normalize the matrix to have maximum absolute value of 1. A.4.2 Chebyshev A A 2D Chebyshev polynomial basis defined on [0, W] × [0, H] is defined as follows: \phi _{w,h}(x,y) = \cos \Bigl (w\cos ^{-1}\bigl (\frac {2x}{W}-1\bi… view at source ↗
Figure 8
Figure 8. Figure 8: Image displayed with respective τ values [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (Left) Illustration of the channel-to-coefficient index mapping. (Right) Channels illustrated [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional results from the channel revealing experiment [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generation results from the Cosmos tokenizer [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional generation results from the Flux tokenizer [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

Image tokenizers play a central role in modern generative models, where the structure of the latent space critically determines the downstream generation performance. A key but underexplored property of effective latent representations is spectral organization, the ability to encode information across frequency components. In this work, we introduce structured state-space regularization, a principled approach to inducing spectral structure in latent spaces. We derive a regularization objective by revisiting state-space models (SSMs) as systems mimicking a basis function's behavior. This perspective reveals that hidden states of SSMs are induced to capture the frequency components, resulting in a novel regularizer that enforces the latent space to capture spectral structure of images. Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes structured state-space regularization to induce spectral organization in the latent spaces of image tokenizers for generative models. It derives a regularization objective by treating state-space models (SSMs) as systems that mimic basis functions, which is claimed to induce hidden states to capture frequency components; this regularizer is then transferred to enforce spectral structure in tokenizer latents. Experiments are said to demonstrate improved generative performance with only minimal loss in reconstruction fidelity.

Significance. If the derivation is non-circular and the regularizer demonstrably enforces transferable spectral structure in 2-D latents, the approach could provide a principled mechanism for improving downstream generation quality in tokenizer-based models while preserving reconstruction. The minimal-reconstruction-loss aspect would be a practical strength if quantified across standard metrics and datasets.

major comments (2)
  1. [§3] §3 (Derivation of the regularizer): The central step from 'SSMs mimicking a basis function' to 'hidden states capture frequency components that transfer to enforce spectral structure in image tokenizer latents' is load-bearing for the claim. The manuscript must supply the explicit equations showing how the frequency-capture property is induced and transferred without assuming the desired spectral organization by construction; otherwise the regularizer risks being tautological rather than independently grounded.
  2. [Experiments] Experiments (quantitative results): The claim of improved generative performance with minimal reconstruction loss requires specific metrics (e.g., FID, reconstruction PSNR/SSIM), dataset details, and ablations that isolate the effect of the spectral regularizer versus other factors. Without these, attribution to the intended mechanism cannot be verified.
minor comments (1)
  1. [Abstract] Abstract: While the high-level claim is clear, the absence of any equation or numerical result makes it hard to evaluate the derivation or experimental support at first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the derivation and experimental validation. We have revised the manuscript to address the concerns by providing more explicit equations and quantitative details.

read point-by-point responses
  1. Referee: [§3] §3 (Derivation of the regularizer): The central step from 'SSMs mimicking a basis function' to 'hidden states capture frequency components that transfer to enforce spectral structure in image tokenizer latents' is load-bearing for the claim. The manuscript must supply the explicit equations showing how the frequency-capture property is induced and transferred without assuming the desired spectral organization by construction; otherwise the regularizer risks being tautological rather than independently grounded.

    Authors: We agree that explicit equations are essential to establish the derivation as non-circular. In the revised manuscript, Section 3 now includes the full state-space formulation where the SSM dynamics are derived to mimic basis function responses through the continuous-time state transition matrix. Eigenvalue decomposition of the state matrix explicitly maps hidden state dimensions to distinct frequency modes. The regularizer is then transferred by penalizing deviation of the tokenizer latent trajectories from these SSM-induced frequency responses, using an independent objective derived from SSM properties rather than assuming spectral structure in the latents upfront. revision: yes

  2. Referee: [Experiments] Experiments (quantitative results): The claim of improved generative performance with minimal reconstruction loss requires specific metrics (e.g., FID, reconstruction PSNR/SSIM), dataset details, and ablations that isolate the effect of the spectral regularizer versus other factors. Without these, attribution to the intended mechanism cannot be verified.

    Authors: We have updated the experiments section with the requested specifics. Generation performance is now quantified using FID on ImageNet, while reconstruction uses PSNR and SSIM on CIFAR-10 and ImageNet. Ablation studies isolate the spectral regularizer by comparing against unregularized tokenizers and alternatives such as perceptual or KL-based losses. These results, detailed in Section 4 and the associated tables, show improved FID with only marginal changes in PSNR/SSIM, supporting attribution to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent regularizer

full rationale

The paper proposes a regularization objective derived from viewing SSMs as systems that mimic basis functions, which induces hidden states to capture frequency components and thereby defines a regularizer for spectral structure in image tokenizer latents. This is a constructive modeling choice rather than a reduction of the claimed result to its own inputs by construction. No equations are shown that equate the regularizer output directly to a fitted parameter or prior self-citation; the central claim rests on the novelty of the regularizer plus empirical tests of generative performance versus reconstruction fidelity. The derivation chain therefore remains self-contained against external benchmarks and does not rely on load-bearing self-citations or renaming of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the SSM reinterpretation as basis-function mimicry and on the experimental demonstration of improved generation; both steps are described only at high level in the abstract.

free parameters (1)
  • regularization coefficient
    A scalar weighting the new regularizer against the reconstruction loss; its value must be chosen or tuned for the reported experiments.
axioms (1)
  • domain assumption State-space models can be viewed as systems that mimic basis-function behavior
    Invoked to derive the regularization objective that induces frequency capture in hidden states.

pith-pipeline@v0.9.0 · 5674 in / 1280 out tokens · 64552 ms · 2026-05-21T00:34:12.087341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 8 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y ., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y ., Cui, Y ., Ding, Y ., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

  2. [2]

    IEEE transactions on Comput- ers100(1), 90–93 (2006)

    Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE transactions on Comput- ers100(1), 90–93 (2006)

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y ., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)

  4. [4]

    Prentice Hall Professional Technical Reference (1982)

    Ballard, D.H., Brown, C.M.: Computer vision. Prentice Hall Professional Technical Reference (1982)

  5. [5]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y .: Navigation world models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15791–15801 (2025)

  6. [6]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y ., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

  7. [7]

    In: The Twelfth International Conference on Learning Representations (2024)

    Baron, E., Zimerman, I., Wolf, L.: A 2-dimensional state space layer for spatial inductive bias. In: The Twelfth International Conference on Learning Representations (2024)

  8. [8]

    Black Forest Labs: Flux.https://github.com/black-forest-labs/flux(2023)

  9. [9]

    Boutell, T.: Png (portable network graphics) specification version 1.0. Tech. rep. (1997)

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked generative image transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11315–11325 (2022)

  11. [11]

    In: Forty-second International Conference on Machine Learning (2025) 11

    Chen, H., Han, Y ., Chen, F., Li, X., Wang, Y ., Wang, J., Wang, Z., Liu, Z., Zou, D., Raj, B.: Masked autoencoders are effective tokenizers for diffusion models. In: Forty-second International Conference on Machine Learning (2025) 11

  12. [12]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. In: The Thirteenth International Conference on Learning Representations (2025)

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chen, J., Zou, D., He, W., Chen, J., Xie, E., Han, S., Cai, H.: Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19628–19637 (2025)

  14. [14]

    In: CVPR (2009)

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

  15. [15]

    Dieleman, S.: Diffusion is spectral autoregression (2024), https://sander.ai/2024/09/ 02/spectral-autoregression.html

  16. [16]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

  17. [17]

    In: First conference on language modeling (2024)

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024)

  18. [18]

    Advances in neural information processing systems33, 1474–1487 (2020)

    Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C.: Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems33, 1474–1487 (2020)

  19. [19]

    Advances in Neural Information Processing Systems35, 35971–35983 (2022)

    Gu, A., Goel, K., Gupta, A., Ré, C.: On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems35, 35971–35983 (2022)

  20. [20]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021)

  21. [21]

    Advances in neural information processing systems34, 572–585 (2021)

    Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems34, 572–585 (2021)

  22. [22]

    How to train your hippo: State space models with generalized orthogonal basis projections

    Gu, A., Johnson, I., Timalsina, A., Rudra, A., Ré, C.: How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037 (2022)

  23. [23]

    Advances in Neural Information Processing Systems35, 22982–22994 (2022)

    Gupta, A., Gu, A., Berant, J.: Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems35, 22982–22994 (2022)

  24. [24]

    Dream to Control: Learning Behaviors by Latent Imagination

    Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)

  25. [25]

    In: The Eleventh International Conference on Learning Representations (2023)

    Hasani, R., Lechner, M., Wang, T.H., Chahine, M., Amini, A., Rus, D.: Liquid structural state-space models. In: The Eleventh International Conference on Learning Representations (2023)

  26. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  27. [27]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  28. [28]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  29. [29]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  30. [30]

    In: European conference on computer vision

    Hu, V .T., Baumann, S.A., Gui, M., Grebenkova, O., Ma, P., Fischer, J., Ommer, B.: Zigma: A dit-style zigzag mamba diffusion model. In: European conference on computer vision. pp. 148–166. Springer (2024) 12

  31. [31]

    In: Proc

    Hummel, R., Kimia, B., Zucker, S.: Gaussian blur and the heat equation: forward and inverse solutions. In: Proc. of Int. Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 668–671 (1985)

  32. [32]

    In: arxiv (2025)

    Kouzelis, T., Ioannis, K., Spyros, G., Nikos, K.: Eq-vae: Equivariance regularized latent space for improved generative image modeling. In: arxiv (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11523–11532 (2022)

  34. [34]

    In: The Fourteenth International Conference on Learning Representations (2026)

    Lee, J., Kwak, S.: Exploring state-space models for data-specific neural representations. In: The Fourteenth International Conference on Learning Representations (2026)

  35. [35]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Leng, X., Singh, J., Hou, Y ., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlocking vae for end-to- end tuning of latent diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18262–18272 (2025)

  36. [36]

    In: European Conference on Computer Vision

    Li, K., Li, X., Wang, Y ., He, Y ., Wang, Y ., Wang, L., Qiao, Y .: Videomamba: State space model for efficient video understanding. In: European Conference on Computer Vision. pp. 237–255. Springer (2025)

  37. [37]

    In: European Conference on Computer Vision

    Li, S., Singh, H., Grover, A.: Mamba-nd: Selective state space modeling for multi-dimensional data. In: European Conference on Computer Vision. pp. 75–92. Springer (2024)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D.: Mage: Masked generative encoder to unify representation learning and image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2142–2152 (2023)

  39. [39]

    Advances in Neural Information Processing Systems37, 56424–56445 (2024)

    Li, T., Tian, Y ., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems37, 56424–56445 (2024)

  40. [40]

    Micro- controllers & Embedded Systems3, 2 (2012)

    Lian, L., Shilei, W.: Webp: A new image compression format based on vp8 encoding. Micro- controllers & Embedded Systems3, 2 (2012)

  41. [41]

    Advances in neural information processing systems 37, 32653–32677 (2024)

    Liang, D., Zhou, X., Xu, W., Zhu, X., Zou, Z., Ye, X., Tan, X., Bai, X.: Pointmamba: A simple state space model for point cloud analysis. Advances in neural information processing systems 37, 32653–32677 (2024)

  42. [42]

    Journal of applied statistics21(1-2), 225–270 (1994)

    Lindeberg, T.: Scale-space theory: A basic tool for analyzing structures at different scales. Journal of applied statistics21(1-2), 225–270 (1994)

  43. [43]

    Advances in neural information processing systems37, 103031–103063 (2024)

    Liu, Y ., Tian, Y ., Zhao, Y ., Yu, H., Xie, L., Wang, Y ., Ye, Q., Jiao, J., Liu, Y .: Vmamba: Visual state space model. Advances in neural information processing systems37, 103031–103063 (2024)

  44. [44]

    In: The Eleventh International Conference on Learning Representations (2023)

    Mehta, H., Gupta, A., Cutkosky, A., Neyshabur, B.: Long range language modeling via gated state spaces. In: The Eleventh International Conference on Learning Representations (2023)

  45. [45]

    Advances in neural information processing systems35, 2846–2861 (2022)

    Nguyen, E., Goel, K., Gu, A., Downs, G., Shah, P., Dao, T., Baccus, S., Ré, C.: S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems35, 2846–2861 (2022)

  46. [46]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  47. [47]

    IEEE Transac- tions on pattern analysis and machine intelligence12(7), 629–639 (2002)

    Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transac- tions on pattern analysis and machine intelligence12(7), 629–639 (2002)

  48. [48]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understand- ing by generative pre-training (2018)

  49. [49]

    In: International conference on machine learning

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Radford, A., Chen, M., Sutskever, I.: Zero- shot text-to-image generation. In: International conference on machine learning. pp. 8821–8831. Pmlr (2021) 13

  50. [50]

    264 advanced video compression standard

    Richardson, I.E.: The H. 264 advanced video compression standard. John Wiley & Sons (2011)

  51. [51]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  52. [52]

    direct solvers of second-and fourth-order equations using legendre polynomials

    Shen, J.: Efficient spectral-galerkin method i. direct solvers of second-and fourth-order equations using legendre polynomials. SIAM Journal on Scientific Computing15(6), 1489–1505 (1994)

  53. [53]

    direct solvers of second-and fourth-order equa- tions using chebyshev polynomials

    Shen, J.: Efficient spectral-galerkin method ii. direct solvers of second-and fourth-order equa- tions using chebyshev polynomials. SIAM Journal on Scientific Computing16(1), 74–87 (1995)

  54. [54]

    In: Forty-second International Conference on Machine Learning (2025)

    Skorokhodov, I., Girish, S., Hu, B., Menapace, W., Li, Y ., Abdal, R., Tulyakov, S., Siarohin, A.: Improving the diffusability of autoencoders. In: Forty-second International Conference on Machine Learning (2025)

  55. [55]

    In: The Eleventh International Conference on Learning Representations (2023)

    Smith, J.T., Warrington, A., Linderman, S.: Simplified state space layers for sequence modeling. In: The Eleventh International Conference on Learning Representations (2023)

  56. [56]

    In: International conference on machine learning

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. pmlr (2015)

  57. [57]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  58. [58]

    CoRR (2024)

    Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregressive model beats diffusion: Llama for scalable image generation. CoRR (2024)

  59. [59]

    Advances in neural information processing systems 37, 84839–84865 (2024)

    Tian, K., Jiang, Y ., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37, 84839–84865 (2024)

  60. [60]

    Advances in neural information processing systems34, 11287–11302 (2021)

    Vahdat, A., Kreis, K., Kautz, J.: Score-based generative modeling in latent space. Advances in neural information processing systems34, 11287–11302 (2021)

  61. [61]

    In: Advances in Neural Information Processing Systems

    V oelker, A., Kaji´c, I., Eliasmith, C.: Legendre memory units: Continuous-time representation in recurrent neural networks. In: Advances in Neural Information Processing Systems. pp. 15544–15553 (2019)

  62. [62]

    Communications of the ACM34(4), 30–44 (1991)

    Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM34(4), 30–44 (1991)

  63. [63]

    Weickert, J., et al.: Anisotropic diffusion in image processing, vol. 1. Teubner Stuttgart (1998)

  64. [64]

    generation: Taming optimization dilemma in latent diffusion models

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025)

  65. [65]

    arXiv preprint arXiv:2410.02035 (2024)

    Yu, A., Lyu, D., Lim, S.H., Mahoney, M.W., Erichson, N.B.: Tuning frequency bias of state space models. arXiv preprint arXiv:2410.02035 (2024)

  66. [66]

    Courier Dover Publica- tions (2008)

    Zadeh, L., Desoer, C.: Linear system theory: the state space approach. Courier Dover Publica- tions (2008)

  67. [67]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, J., Nguyen, A.T., Han, X., Trinh, V .Q.H., Qin, H., Samaras, D., Hosseini, M.S.: 2dmamba: Efficient state space model for image representation with applications on giga- pixel whole slide image classification. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3583–3592 (2025)

  68. [68]

    Diffusion Transformers with Representation Autoencoders

    Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025)

  69. [69]

    In: International Conference on Machine Learning

    Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. In: International Conference on Machine Learning. pp. 62429–62442. PMLR (2024) 14 Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization A Appendix We provide supporti...

  70. [70]

    Then, by the similar logic of Eq

    Thus, the 2D Hermite basis defined on[0, W]×[0, H]is: ϕw,h(x, y) =ϕ R w 4x W −2 ·ϕ R h 4y H −2 (69) Reparameterize (u, v) = 4x W −2, 4y H −2 , and let the weight function of the Hermite polynomial ω(u, v) = e−(u2 +v2 ) √π . Then, by the similar logic of Eq. (53),⟨ϕ w1,h1 ,∇ 2ϕw2,h2 ⟩ω becomes: ⟨ϕw1,h1 ,∇ 2ϕw2,h2 ⟩ω (70) = H W ⟨ϕR w1 , ϕR w2 ′′⟩ω · ⟨ϕR h1 ...