Structured State-Space Regularization for Generation-Friendly Image Tokenization
Pith reviewed 2026-05-21 00:34 UTC · model grok-4.3
The pith
Structured state-space regularization induces spectral structure in image tokenizer latent spaces via an SSM-derived objective, improving generative performance with minimal reconstruction loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.
Load-bearing premise
That revisiting state-space models as systems mimicking a basis function's behavior induces hidden states to capture frequency components that can be transferred via regularization to enforce useful spectral structure in image tokenizer latent spaces.
Figures
read the original abstract
Image tokenizers play a central role in modern generative models, where the structure of the latent space critically determines the downstream generation performance. A key but underexplored property of effective latent representations is spectral organization, the ability to encode information across frequency components. In this work, we introduce structured state-space regularization, a principled approach to inducing spectral structure in latent spaces. We derive a regularization objective by revisiting state-space models (SSMs) as systems mimicking a basis function's behavior. This perspective reveals that hidden states of SSMs are induced to capture the frequency components, resulting in a novel regularizer that enforces the latent space to capture spectral structure of images. Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes structured state-space regularization to induce spectral organization in the latent spaces of image tokenizers for generative models. It derives a regularization objective by treating state-space models (SSMs) as systems that mimic basis functions, which is claimed to induce hidden states to capture frequency components; this regularizer is then transferred to enforce spectral structure in tokenizer latents. Experiments are said to demonstrate improved generative performance with only minimal loss in reconstruction fidelity.
Significance. If the derivation is non-circular and the regularizer demonstrably enforces transferable spectral structure in 2-D latents, the approach could provide a principled mechanism for improving downstream generation quality in tokenizer-based models while preserving reconstruction. The minimal-reconstruction-loss aspect would be a practical strength if quantified across standard metrics and datasets.
major comments (2)
- [§3] §3 (Derivation of the regularizer): The central step from 'SSMs mimicking a basis function' to 'hidden states capture frequency components that transfer to enforce spectral structure in image tokenizer latents' is load-bearing for the claim. The manuscript must supply the explicit equations showing how the frequency-capture property is induced and transferred without assuming the desired spectral organization by construction; otherwise the regularizer risks being tautological rather than independently grounded.
- [Experiments] Experiments (quantitative results): The claim of improved generative performance with minimal reconstruction loss requires specific metrics (e.g., FID, reconstruction PSNR/SSIM), dataset details, and ablations that isolate the effect of the spectral regularizer versus other factors. Without these, attribution to the intended mechanism cannot be verified.
minor comments (1)
- [Abstract] Abstract: While the high-level claim is clear, the absence of any equation or numerical result makes it hard to evaluate the derivation or experimental support at first reading.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the derivation and experimental validation. We have revised the manuscript to address the concerns by providing more explicit equations and quantitative details.
read point-by-point responses
-
Referee: [§3] §3 (Derivation of the regularizer): The central step from 'SSMs mimicking a basis function' to 'hidden states capture frequency components that transfer to enforce spectral structure in image tokenizer latents' is load-bearing for the claim. The manuscript must supply the explicit equations showing how the frequency-capture property is induced and transferred without assuming the desired spectral organization by construction; otherwise the regularizer risks being tautological rather than independently grounded.
Authors: We agree that explicit equations are essential to establish the derivation as non-circular. In the revised manuscript, Section 3 now includes the full state-space formulation where the SSM dynamics are derived to mimic basis function responses through the continuous-time state transition matrix. Eigenvalue decomposition of the state matrix explicitly maps hidden state dimensions to distinct frequency modes. The regularizer is then transferred by penalizing deviation of the tokenizer latent trajectories from these SSM-induced frequency responses, using an independent objective derived from SSM properties rather than assuming spectral structure in the latents upfront. revision: yes
-
Referee: [Experiments] Experiments (quantitative results): The claim of improved generative performance with minimal reconstruction loss requires specific metrics (e.g., FID, reconstruction PSNR/SSIM), dataset details, and ablations that isolate the effect of the spectral regularizer versus other factors. Without these, attribution to the intended mechanism cannot be verified.
Authors: We have updated the experiments section with the requested specifics. Generation performance is now quantified using FID on ImageNet, while reconstruction uses PSNR and SSIM on CIFAR-10 and ImageNet. Ablation studies isolate the spectral regularizer by comparing against unregularized tokenizers and alternatives such as perceptual or KL-based losses. These results, detailed in Section 4 and the associated tables, show improved FID with only marginal changes in PSNR/SSIM, supporting attribution to the proposed mechanism. revision: yes
Circularity Check
No significant circularity; derivation introduces independent regularizer
full rationale
The paper proposes a regularization objective derived from viewing SSMs as systems that mimic basis functions, which induces hidden states to capture frequency components and thereby defines a regularizer for spectral structure in image tokenizer latents. This is a constructive modeling choice rather than a reduction of the claimed result to its own inputs by construction. No equations are shown that equate the regularizer output directly to a fitted parameter or prior self-citation; the central claim rests on the novelty of the regularizer plus empirical tests of generative performance versus reconstruction fidelity. The derivation chain therefore remains self-contained against external benchmarks and does not rely on load-bearing self-citations or renaming of known results.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization coefficient
axioms (1)
- domain assumption State-space models can be viewed as systems that mimic basis-function behavior
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_eq_pow / phi_ladder spectral decomposition echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We argue that such an update represents the core principle of SSMs... characterized by two key components: a basis projection c... and an input transformation θ... Basis coefficients often correspond to the magnitudes of spectral components... hidden state xt is trained to resemble the behavior of the basis coefficients c(·)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanSphereAdmitsCircleLinking / orthogonal-basis coefficient dynamics refines?
refinesRelation between the paper passage and the cited Recognition theorem.
d/dτ ck(Iτ) = 1/2 Σ cn(Iτ) ⟨∇²ϕn, ϕk⟩ ... := A c(Iτ) ... (Euler) c(It) = (I + A Δ) c(It−1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y ., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y ., Cui, Y ., Ding, Y ., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
IEEE transactions on Comput- ers100(1), 90–93 (2006)
Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE transactions on Comput- ers100(1), 90–93 (2006)
work page 2006
-
[3]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y ., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)
work page 2023
-
[4]
Prentice Hall Professional Technical Reference (1982)
Ballard, D.H., Brown, C.M.: Computer vision. Prentice Hall Professional Technical Reference (1982)
work page 1982
-
[5]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y .: Navigation world models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15791–15801 (2025)
work page 2025
-
[6]
Revisiting Feature Prediction for Learning Visual Representations from Video
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y ., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
In: The Twelfth International Conference on Learning Representations (2024)
Baron, E., Zimerman, I., Wolf, L.: A 2-dimensional state space layer for spatial inductive bias. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[8]
Black Forest Labs: Flux.https://github.com/black-forest-labs/flux(2023)
work page 2023
-
[9]
Boutell, T.: Png (portable network graphics) specification version 1.0. Tech. rep. (1997)
work page 1997
-
[10]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked generative image transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11315–11325 (2022)
work page 2022
-
[11]
In: Forty-second International Conference on Machine Learning (2025) 11
Chen, H., Han, Y ., Chen, F., Li, X., Wang, Y ., Wang, J., Wang, Z., Liu, Z., Zou, D., Raj, B.: Masked autoencoders are effective tokenizers for diffusion models. In: Forty-second International Conference on Machine Learning (2025) 11
work page 2025
-
[12]
In: The Thirteenth International Conference on Learning Representations (2025)
Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. In: The Thirteenth International Conference on Learning Representations (2025)
work page 2025
-
[13]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, J., Zou, D., He, W., Chen, J., Xie, E., Han, S., Cai, H.: Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19628–19637 (2025)
work page 2025
-
[14]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
work page 2009
-
[15]
Dieleman, S.: Diffusion is spectral autoregression (2024), https://sander.ai/2024/09/ 02/spectral-autoregression.html
work page 2024
-
[16]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)
work page 2021
-
[17]
In: First conference on language modeling (2024)
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024)
work page 2024
-
[18]
Advances in neural information processing systems33, 1474–1487 (2020)
Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C.: Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems33, 1474–1487 (2020)
work page 2020
-
[19]
Advances in Neural Information Processing Systems35, 35971–35983 (2022)
Gu, A., Goel, K., Gupta, A., Ré, C.: On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems35, 35971–35983 (2022)
work page 2022
-
[20]
Efficiently Modeling Long Sequences with Structured State Spaces
Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Advances in neural information processing systems34, 572–585 (2021)
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems34, 572–585 (2021)
work page 2021
-
[22]
How to train your hippo: State space models with generalized orthogonal basis projections
Gu, A., Johnson, I., Timalsina, A., Rudra, A., Ré, C.: How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037 (2022)
-
[23]
Advances in Neural Information Processing Systems35, 22982–22994 (2022)
Gupta, A., Gu, A., Berant, J.: Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems35, 22982–22994 (2022)
work page 2022
-
[24]
Dream to Control: Learning Behaviors by Latent Imagination
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[25]
In: The Eleventh International Conference on Learning Representations (2023)
Hasani, R., Lechner, M., Wang, T.H., Chahine, M., Amini, A., Rus, D.: Liquid structural state-space models. In: The Eleventh International Conference on Learning Representations (2023)
work page 2023
-
[26]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
work page 2022
-
[27]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
work page 2020
-
[28]
Classifier-Free Diffusion Guidance
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
In: European conference on computer vision
Hu, V .T., Baumann, S.A., Gui, M., Grebenkova, O., Ma, P., Fischer, J., Ommer, B.: Zigma: A dit-style zigzag mamba diffusion model. In: European conference on computer vision. pp. 148–166. Springer (2024) 12
work page 2024
- [31]
-
[32]
Kouzelis, T., Ioannis, K., Spyros, G., Nikos, K.: Eq-vae: Equivariance regularized latent space for improved generative image modeling. In: arxiv (2025)
work page 2025
-
[33]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11523–11532 (2022)
work page 2022
-
[34]
In: The Fourteenth International Conference on Learning Representations (2026)
Lee, J., Kwak, S.: Exploring state-space models for data-specific neural representations. In: The Fourteenth International Conference on Learning Representations (2026)
work page 2026
-
[35]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Leng, X., Singh, J., Hou, Y ., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlocking vae for end-to- end tuning of latent diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18262–18272 (2025)
work page 2025
-
[36]
In: European Conference on Computer Vision
Li, K., Li, X., Wang, Y ., He, Y ., Wang, Y ., Wang, L., Qiao, Y .: Videomamba: State space model for efficient video understanding. In: European Conference on Computer Vision. pp. 237–255. Springer (2025)
work page 2025
-
[37]
In: European Conference on Computer Vision
Li, S., Singh, H., Grover, A.: Mamba-nd: Selective state space modeling for multi-dimensional data. In: European Conference on Computer Vision. pp. 75–92. Springer (2024)
work page 2024
-
[38]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D.: Mage: Masked generative encoder to unify representation learning and image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2142–2152 (2023)
work page 2023
-
[39]
Advances in Neural Information Processing Systems37, 56424–56445 (2024)
Li, T., Tian, Y ., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems37, 56424–56445 (2024)
work page 2024
-
[40]
Micro- controllers & Embedded Systems3, 2 (2012)
Lian, L., Shilei, W.: Webp: A new image compression format based on vp8 encoding. Micro- controllers & Embedded Systems3, 2 (2012)
work page 2012
-
[41]
Advances in neural information processing systems 37, 32653–32677 (2024)
Liang, D., Zhou, X., Xu, W., Zhu, X., Zou, Z., Ye, X., Tan, X., Bai, X.: Pointmamba: A simple state space model for point cloud analysis. Advances in neural information processing systems 37, 32653–32677 (2024)
work page 2024
-
[42]
Journal of applied statistics21(1-2), 225–270 (1994)
Lindeberg, T.: Scale-space theory: A basic tool for analyzing structures at different scales. Journal of applied statistics21(1-2), 225–270 (1994)
work page 1994
-
[43]
Advances in neural information processing systems37, 103031–103063 (2024)
Liu, Y ., Tian, Y ., Zhao, Y ., Yu, H., Xie, L., Wang, Y ., Ye, Q., Jiao, J., Liu, Y .: Vmamba: Visual state space model. Advances in neural information processing systems37, 103031–103063 (2024)
work page 2024
-
[44]
In: The Eleventh International Conference on Learning Representations (2023)
Mehta, H., Gupta, A., Cutkosky, A., Neyshabur, B.: Long range language modeling via gated state spaces. In: The Eleventh International Conference on Learning Representations (2023)
work page 2023
-
[45]
Advances in neural information processing systems35, 2846–2861 (2022)
Nguyen, E., Goel, K., Gu, A., Downs, G., Shah, P., Dao, T., Baccus, S., Ré, C.: S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems35, 2846–2861 (2022)
work page 2022
-
[46]
In: Proceedings of the IEEE/CVF international conference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
work page 2023
-
[47]
IEEE Transac- tions on pattern analysis and machine intelligence12(7), 629–639 (2002)
Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transac- tions on pattern analysis and machine intelligence12(7), 629–639 (2002)
work page 2002
-
[48]
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understand- ing by generative pre-training (2018)
work page 2018
-
[49]
In: International conference on machine learning
Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Radford, A., Chen, M., Sutskever, I.: Zero- shot text-to-image generation. In: International conference on machine learning. pp. 8821–8831. Pmlr (2021) 13
work page 2021
-
[50]
264 advanced video compression standard
Richardson, I.E.: The H. 264 advanced video compression standard. John Wiley & Sons (2011)
work page 2011
-
[51]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[52]
direct solvers of second-and fourth-order equations using legendre polynomials
Shen, J.: Efficient spectral-galerkin method i. direct solvers of second-and fourth-order equations using legendre polynomials. SIAM Journal on Scientific Computing15(6), 1489–1505 (1994)
work page 1994
-
[53]
direct solvers of second-and fourth-order equa- tions using chebyshev polynomials
Shen, J.: Efficient spectral-galerkin method ii. direct solvers of second-and fourth-order equa- tions using chebyshev polynomials. SIAM Journal on Scientific Computing16(1), 74–87 (1995)
work page 1995
-
[54]
In: Forty-second International Conference on Machine Learning (2025)
Skorokhodov, I., Girish, S., Hu, B., Menapace, W., Li, Y ., Abdal, R., Tulyakov, S., Siarohin, A.: Improving the diffusability of autoencoders. In: Forty-second International Conference on Machine Learning (2025)
work page 2025
-
[55]
In: The Eleventh International Conference on Learning Representations (2023)
Smith, J.T., Warrington, A., Linderman, S.: Simplified state space layers for sequence modeling. In: The Eleventh International Conference on Learning Representations (2023)
work page 2023
-
[56]
In: International conference on machine learning
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. pmlr (2015)
work page 2015
-
[57]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y ., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[58]
Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregressive model beats diffusion: Llama for scalable image generation. CoRR (2024)
work page 2024
-
[59]
Advances in neural information processing systems 37, 84839–84865 (2024)
Tian, K., Jiang, Y ., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37, 84839–84865 (2024)
work page 2024
-
[60]
Advances in neural information processing systems34, 11287–11302 (2021)
Vahdat, A., Kreis, K., Kautz, J.: Score-based generative modeling in latent space. Advances in neural information processing systems34, 11287–11302 (2021)
work page 2021
-
[61]
In: Advances in Neural Information Processing Systems
V oelker, A., Kaji´c, I., Eliasmith, C.: Legendre memory units: Continuous-time representation in recurrent neural networks. In: Advances in Neural Information Processing Systems. pp. 15544–15553 (2019)
work page 2019
-
[62]
Communications of the ACM34(4), 30–44 (1991)
Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM34(4), 30–44 (1991)
work page 1991
-
[63]
Weickert, J., et al.: Anisotropic diffusion in image processing, vol. 1. Teubner Stuttgart (1998)
work page 1998
-
[64]
generation: Taming optimization dilemma in latent diffusion models
Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025)
work page 2025
-
[65]
arXiv preprint arXiv:2410.02035 (2024)
Yu, A., Lyu, D., Lim, S.H., Mahoney, M.W., Erichson, N.B.: Tuning frequency bias of state space models. arXiv preprint arXiv:2410.02035 (2024)
-
[66]
Courier Dover Publica- tions (2008)
Zadeh, L., Desoer, C.: Linear system theory: the state space approach. Courier Dover Publica- tions (2008)
work page 2008
-
[67]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhang, J., Nguyen, A.T., Han, X., Trinh, V .Q.H., Qin, H., Samaras, D., Hosseini, M.S.: 2dmamba: Efficient state space model for image representation with applications on giga- pixel whole slide image classification. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3583–3592 (2025)
work page 2025
-
[68]
Diffusion Transformers with Representation Autoencoders
Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
In: International Conference on Machine Learning
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. In: International Conference on Machine Learning. pp. 62429–62442. PMLR (2024) 14 Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization A Appendix We provide supporti...
work page 2024
-
[70]
Then, by the similar logic of Eq
Thus, the 2D Hermite basis defined on[0, W]×[0, H]is: ϕw,h(x, y) =ϕ R w 4x W −2 ·ϕ R h 4y H −2 (69) Reparameterize (u, v) = 4x W −2, 4y H −2 , and let the weight function of the Hermite polynomial ω(u, v) = e−(u2 +v2 ) √π . Then, by the similar logic of Eq. (53),⟨ϕ w1,h1 ,∇ 2ϕw2,h2 ⟩ω becomes: ⟨ϕw1,h1 ,∇ 2ϕw2,h2 ⟩ω (70) = H W ⟨ϕR w1 , ϕR w2 ′′⟩ω · ⟨ϕR h1 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.