pith. sign in

arxiv: 2606.06624 · v2 · pith:UNIRQA77new · submitted 2026-06-04 · 💻 cs.LG

Principles and Practice of Deep Representation Learning: or a Mathematical Theory of Memory

Pith reviewed 2026-06-28 03:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords deep learningrepresentation learningoptimizationinformation theoryneural network architecturesinterpretabilitymemory
0
0 comments X

The pith

Deep neural network architectures follow from optimization and information theory, reducing design to linear algebra and calculus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that representation learning, the main driver of deep learning's empirical success, can be understood and derived through principles of optimization and information theory. This would allow the internal mechanisms of large models to be opened up for interpretation, reliability, and control rather than remaining opaque black boxes. A sympathetic reader would care because it promises to replace trial-and-error architecture tuning with predictable derivations from undergraduate mathematics.

Core claim

The mechanisms of large deep networks are understood through representation learning, arguably the single most important factor in their empirical power, and the design principles of modern neural network architectures are explained through optimization and information theory, turning the process of architecture development into undergraduate-level linear algebra and calculus exercises once the principles are introduced.

What carries the argument

Representation learning mechanisms derived from optimization and information theory, which carries the argument by supplying the core internal operations that explain why deep networks work.

If this is right

  • Architecture development reduces to standard undergraduate linear algebra and calculus exercises.
  • New methods and models become efficient, interpretable, and controllable by design while matching or exceeding black-box performance.
  • Problems in various domains can be solved using these derived principles in more systematic ways.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principles might be extended to derive architectures for generative models without separate empirical search.
  • If the derivations hold, they could provide a route to provable guarantees on model behavior that current empirical approaches lack.
  • Connections to memory mechanisms in the title suggest potential links to classical theories of storage and retrieval in linear systems.

Load-bearing premise

The empirical power of deep learning is driven primarily by representation learning mechanisms that can be fully derived from optimization and information theory without requiring additional empirical tuning or domain-specific assumptions.

What would settle it

A high-performing deep network architecture that cannot be derived or explained using only optimization and information theory, or whose performance requires extra assumptions outside those frameworks, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.06624 by Druv Pai, Peng Wang, Sam Buchanan, Yi Ma.

Figure 1.1
Figure 1.1. Figure 1.1: Evolution of phylogenetic intelligence: Memory or knowledge of the [PITH_FULL_IMAGE:figures/full_fig_p028_1_1.png] view at source ↗
Figure 1.2
Figure 1.2. Figure 1.2: Evolution of life, from the ancestor of all life today (LUCA—last [PITH_FULL_IMAGE:figures/full_fig_p028_1_2.png] view at source ↗
Figure 1.3
Figure 1.3. Figure 1.3: The development of verbal communication and spoken languages [PITH_FULL_IMAGE:figures/full_fig_p030_1_3.png] view at source ↗
Figure 1.4
Figure 1.4. Figure 1.4: Norbert Wiener’s book “Cybernetics” (1948) [ [PITH_FULL_IMAGE:figures/full_fig_p031_1_4.png] view at source ↗
Figure 1.5
Figure 1.5. Figure 1.5: Pioneers of theoretical and computational foundations for intelli [PITH_FULL_IMAGE:figures/full_fig_p032_1_5.png] view at source ↗
Figure 1.6
Figure 1.6. Figure 1.6: A two-dimensional subspace in a ten-dimensional ambient space. [PITH_FULL_IMAGE:figures/full_fig_p038_1_6.png] view at source ↗
Figure 1.7
Figure 1.7. Figure 1.7: Famous 17 equations that changed the world. Most of these equa [PITH_FULL_IMAGE:figures/full_fig_p039_1_7.png] view at source ↗
Figure 1.8
Figure 1.8. Figure 1.8: The evolution of the Universe that is, based on the best physics 数据在高维空间中的低维结构体来 [PITH_FULL_IMAGE:figures/full_fig_p040_1_8.png] view at source ↗
Figure 1.9
Figure 1.9. Figure 1.9: Data distributed on a mixture of (orthogonal) subspaces [PITH_FULL_IMAGE:figures/full_fig_p040_1_9.png] view at source ↗
Figure 1.10
Figure 1.10. Figure 1.10: An image of random noise (left) versus a noisy image (middle) and [PITH_FULL_IMAGE:figures/full_fig_p041_1_10.png] view at source ↗
Figure 1.11
Figure 1.11. Figure 1.11: Illustration of properties of a low-dimensional (linear) structure: it [PITH_FULL_IMAGE:figures/full_fig_p042_1_11.png] view at source ↗
Figure 1.12
Figure 1.12. Figure 1.12: A distribution with two principal components. [PITH_FULL_IMAGE:figures/full_fig_p046_1_12.png] view at source ↗
Figure 1.13
Figure 1.13. Figure 1.13: PCA (left) versus ICA (right). Note that PCA finds the [PITH_FULL_IMAGE:figures/full_fig_p047_1_13.png] view at source ↗
Figure 1.14
Figure 1.14. Figure 1.14: Geometric interpretation of a score function [PITH_FULL_IMAGE:figures/full_fig_p051_1_14.png] view at source ↗
Figure 1.15
Figure 1.15. Figure 1.15: The first mathematical model of an artificial neuron (right) that [PITH_FULL_IMAGE:figures/full_fig_p053_1_15.png] view at source ↗
Figure 1.16
Figure 1.16. Figure 1.16: A network with one hidden layer (left) versus a deep network [PITH_FULL_IMAGE:figures/full_fig_p053_1_16.png] view at source ↗
Figure 1.17
Figure 1.17. Figure 1.17: The Mark I Perceptron machine developed by Frank Rosenblatt in [PITH_FULL_IMAGE:figures/full_fig_p054_1_17.png] view at source ↗
Figure 1.18
Figure 1.18. Figure 1.18: Origin of convolutional neural networks: the Neocognitron by Ku [PITH_FULL_IMAGE:figures/full_fig_p055_1_18.png] view at source ↗
Figure 1.19
Figure 1.19. Figure 1.19: The LeNet-5 convolutional neural network designed by Yann LeCun [PITH_FULL_IMAGE:figures/full_fig_p056_1_19.png] view at source ↗
Figure 1.20
Figure 1.20. Figure 1.20: Architecture of LeNet [LBD+89] versus AlexNet [KSH12]. networks such as LeNet showed promising performance on small-scale classi￾fication problems like digit recognition, yet their design was largely empirical, the available datasets were tiny, and back-propagation was computationally pro￾hibitive for the hardware of the era. These factors led to waning interest and stagnant progress, with only a handfu… view at source ↗
Figure 1.21
Figure 1.21. Figure 1.21: AlphaGo: using deep neural networks to model the optimal policy [PITH_FULL_IMAGE:figures/full_fig_p060_1_21.png] view at source ↗
Figure 1.22
Figure 1.22. Figure 1.22: Illustration of an iterative denoising and compressing process that, [PITH_FULL_IMAGE:figures/full_fig_p062_1_22.png] view at source ↗
Figure 1.23
Figure 1.23. Figure 1.23: Comparison of two coding schemes. Imagine the true data distri [PITH_FULL_IMAGE:figures/full_fig_p066_1_23.png] view at source ↗
Figure 1.24
Figure 1.24. Figure 1.24: Transforming the identified low-dimensional data distribution, here [PITH_FULL_IMAGE:figures/full_fig_p067_1_24.png] view at source ↗
Figure 1.25
Figure 1.25. Figure 1.25: Illustration of the architecture of an autoencoder. [PITH_FULL_IMAGE:figures/full_fig_p070_1_25.png] view at source ↗
Figure 1.26
Figure 1.26. Figure 1.26: Illustration of a closed-loop transcription. Here we use a mapping [PITH_FULL_IMAGE:figures/full_fig_p071_1_26.png] view at source ↗
Figure 1.27
Figure 1.27. Figure 1.27: Illustration of a learned informative and structured feature repre [PITH_FULL_IMAGE:figures/full_fig_p073_1_27.png] view at source ↗
Figure 2.1
Figure 2.1. Figure 2.1: Data on a single low-dimensional subspace (left), say [PITH_FULL_IMAGE:figures/full_fig_p077_2_1.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Images of aligned human faces and handwritten digits. Despite [PITH_FULL_IMAGE:figures/full_fig_p079_2_2.png] view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Geometry of PCA. A data point x (red) is projected onto the one￾dimensional learned subspace spanned by the unit basis vector u1 (blue arrow). The projection UU ⊤x = u1u ⊤ 1 x (green) is the denoised version of x using the low￾dimensional structure given by u1, and ε (brown arrow) represents the projection residual or noise. codes {z ⋆ i } N i=1 it suffices to solve either of the two following equivalent… view at source ↗
Figure 2.4
Figure 2.4. Figure 2.4: Left: features tracked on independently moving objects in a scene. Right: image patches associated with different regions of an image. 2.2 A Mixture of Complete Low-Dimensional Sub￾spaces As we have seen, low-rank signal models are rich enough to provide a full picture of the interplay between low-dimensionality in data and efficient and scalable computational algorithms for representation and recovery u… view at source ↗
Figure 2.5
Figure 2.5. Figure 2.5: Maximizing ℓ 4 norm or minimizing ℓ 1 norm promotes sparsity (for vectors on the sphere). hence E[kzk 2 2 ] = d. This modeling assumption implies that the vector of in￾dependent components z is typically very sparse: we calculate E [kzk0] = dθ, which is small when θ is inversely proportional to d. Remark 2.8 (The Orthogonal Assumption). At first sight, the assumption that the dictionary U is orthogonal m… view at source ↗
Figure 2.6
Figure 2.6. Figure 2.6: Comparison of learned dictionary atoms for complete (orthogonal) [PITH_FULL_IMAGE:figures/full_fig_p114_2_6.png] view at source ↗
Figure 2.7
Figure 2.7. Figure 2.7: The graph of the soft thresholding function [PITH_FULL_IMAGE:figures/full_fig_p115_2_7.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: Data X are distributed on a mixture of low-dimensional submani￾folds ∪jMj in a very high-dimensional ambient space, say R D. This is a much more general case than the special cases studied in Chapter 2, where the sub￾manifolds were assumed to be piecewise linear and thus admitted parametric forms and analytical solutions. from the data distribution, rather than prior knowledge of the exact distribution i… view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Illustration of an iterative denoising process that, starting from an [PITH_FULL_IMAGE:figures/full_fig_p121_3_2.png] view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: Diffusing a mixture of Gaussians x. From left to right, we observe the evolution of the density pt of xt as t grows from 0 to 10, along with some represen￾tative samples (red). The plane is colored by the probability density pt; high-density regions are colored darker blue. We observe that the probability mass becomes less concentrated as t increases, signaling that entropy increases. where T ∈ [0, ∞) is… view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: Bayes-optimal denoiser and score of a Gaussian mixture model. In the same setting as [PITH_FULL_IMAGE:figures/full_fig_p126_3_4.png] view at source ↗
Figure 3.5
Figure 3.5. Figure 3.5: Denoising a low-rank mixture of Gaussians. Each figure shows samples from the true data distribution (gray, orange, red) and samples undergoing the denoising process (3.2.82) (light blue). At top left, the process has just started and the noise is very large. As the process continues, the noise is pushed further toward the support of the low-rank data distribution. Finally, in the bottom right, the gener… view at source ↗
Figure 3.6
Figure 3.6. Figure 3.6: The denoising iteration from Example 3.5 is not a contraction mapping. Plot of the contraction coefficient products ∏L ℓ=1 cℓ,L in the case where the data are in D = 1, the time horizon T = 1, and the data variance λ1 = a, for varying values of a. If the denoising iteration were (conceptually similar to) a contraction mapping, these products would go to 0, but instead they converge to non-zero values. Ho… view at source ↗
Figure 3.7
Figure 3.7. Figure 3.7: Visualizing xT versus N (0, T 2 I). Left: A plot of Gaussian mixture model data x. Right: A plot of x as well as xT and an independent sample of N (0, T 2 I), for T = 10. On the right plot, x is plotted in the same colors as the left; however, samples from xT and N (0, T 2 I) are both much larger, on average, than samples from x, and so it appears much smaller because of the scaling. Despite this large b… view at source ↗
Figure 3.8
Figure 3.8. Figure 3.8: Denoising a mixture of Gaussians using the VP diffusion process. We use the same figure setup and data distribution as [PITH_FULL_IMAGE:figures/full_fig_p137_3_8.png] view at source ↗
Figure 3.9
Figure 3.9. Figure 3.9: Generated samples (left) from a diffusion model trained on 6,000 samples and (right) their closest points in the training dataset (in the Euclidean norm). We observe that the generated samples are essentially pixel￾perfect equivalents to the most similar training data. Both sets of images are from [YCK+23] [PITH_FULL_IMAGE:figures/full_fig_p145_3_9.png] view at source ↗
Figure 3.10
Figure 3.10. Figure 3.10: Memorization versus dataset size for a fixed model, on (left) 4,000 training epochs and (right) 40,000 training epochs with smaller datasets. We observe from the left plot that large enough datasets will never be memorized (e.g., 10,000 to 50,000 samples), whereas small datasets (1,000 to 5,000 samples) will be memorized after enough training. This means that as the dataset be￾comes larger relative to t… view at source ↗
Figure 3.11
Figure 3.11. Figure 3.11: The generalization score (left) and training loss (right) for a fixed model and increasing data, from [ZZL+24]. The generalization score is 1− the fraction of samples that are memorized, and the different lines correspond to differently-sized denoiser models (UNet-64 is smaller than UNet-128, which in turn is smaller than UNet-256). The figures show that, as the dataset size becomes larger relative to t… view at source ↗
Figure 3.12
Figure 3.12. Figure 3.12: Visualizing the action of a well-trained denoiser as a perturbation of the empirical denoiser at small times, when the memorizing denoiser acts like a projection onto the nearest point in the training set. because of its equivalence in the situations we had then considered, might still be true: finding the distribution with minimal coding rate is a way to learn the distribution. Supposing for the moment… view at source ↗
Figure 1
Figure 1. Figure 1: Left and Middle: The distribution D of high-dim data x 2 RD is supported on a manifold M and its classes on low-dim submanifolds Mj , we learn a map f(x, ✓) such that zi = f(xi, ✓) are on a union of maximally uncorrelated subspaces {Sj}. Right: Cosine similarity between learned features by our method for the CIFAR10 training dataset. Each class has 5,000 samples and their features span a subspace of over 1… view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Eight points observed on a line with different geometric configura [PITH_FULL_IMAGE:figures/full_fig_p157_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Approximations to the optimal solutions for [PITH_FULL_IMAGE:figures/full_fig_p161_4_3.png] view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: The approximation of a low-dimensional distribution by ϵ balls. We can see that as the ϵ parameter shrinks, the union of ϵ-balls approximates the support of the true distribution (black) increasingly well. Furthermore, the associated denoisers (whose input-output mapping is given by the provided arrows) obtained by approximating the true distribution by a mixture of Gaussians, each with covariance (ϵ 2 /… view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: Covering the region spanned by the data vectors using [PITH_FULL_IMAGE:figures/full_fig_p165_4_5.png] view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Comparison of two lossy coding schemes for data that are distributed [PITH_FULL_IMAGE:figures/full_fig_p167_4_6.png] view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: A number of random samples on a 2D plane. Consider an [PITH_FULL_IMAGE:figures/full_fig_p169_4_7.png] view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: Top: 358 noisy samples drawn from two lines and one plane in [PITH_FULL_IMAGE:figures/full_fig_p170_4_8.png] view at source ↗
Figure 4.9
Figure 4.9. Figure 4.9: Image patches with a size of w × w pixels. of (low-dimensional) Gaussians. Although this seems somewhat idealistic, the measure and algorithm can already be very useful and even powerful in scenarios when the model is (approximately) valid. For example, a natural image typically consists of multiple regions with nearly homogeneous textures. If we take many small windows from each region, they should rese… view at source ↗
Figure 4.10
Figure 4.10. Figure 4.10: Segmentation results based on the clustering algorithm applied to [PITH_FULL_IMAGE:figures/full_fig_p174_4_10.png] view at source ↗
Figure 4.11
Figure 4.11. Figure 4.11: Identifying a low-dimensional distribution with two subspaces (left) [PITH_FULL_IMAGE:figures/full_fig_p176_4_11.png] view at source ↗
Figure 1
Figure 1. Figure 1: Evolution of penultimate layer outputs of a VGG13 neural network when trained on the CIFAR10 Figure 4.12: Evolution of penultimate layer outputs of a VGG13 neural network [PITH_FULL_IMAGE:figures/full_fig_p178_1.png] view at source ↗
Figure 4.13
Figure 4.13. Figure 4.13: The distribution D of high-dimensional data x ∈ R D is supported on a manifold M and its classes on low-dimensional submanifolds Mk. We aim to learn a mapping f(x, θ) parameterized by θ such that zi = f(xi , θ) lie on a union of maximally uncorrelated subspaces {Sk}. low-dimensional submanifold, say Mk with dimension dk D, and the distri￾bution D of x is supported on the mixture of those submanifolds, M… view at source ↗
Figure 4.14
Figure 4.14. Figure 4.14: Comparison between PCA and LDA. Figures adopted from [PITH_FULL_IMAGE:figures/full_fig_p181_4_14.png] view at source ↗
Figure 4.15
Figure 4.15. Figure 4.15: After identifying the low-dimensional data distribution, we would [PITH_FULL_IMAGE:figures/full_fig_p185_4_15.png] view at source ↗
Figure 4.16
Figure 4.16. Figure 4.16: Local optimization landscape: According to Theorem 4.2, the global maximum of the rate reduction objective corresponds to a solution with mutually incoherent subspaces. s.t. Z = f(X, θ), kZΠkk 2 F = Nk, k = 1, . . . , K, Π ∈ Ω. (4.2.17) Compared to (4.2.15), the formulation here allows for the joint optimization of both the group memberships and the network parameters. In particular, when Π is fixed to … view at source ↗
Figure 4.17
Figure 4.17. Figure 4.17: Global optimization landscape: According to [LSJ+16; SQW15], Theorems 4.3 and 4.4, both global and local maxima of the (regular￾ized) rate reduction objective correspond to a solution with mutually incoherent subspaces. All other critical points are strict saddle points. (4.2.22) if and only if (a) it satisfies all the above conditions and PK k=1 rk = min{N, d}, and (b) for all k 6= l ∈ [K] satisfying N… view at source ↗
Figure 4.18
Figure 4.18. Figure 4.18: Evolution of Rϵ, Rc ϵ , ∆Rϵ during the training process [PITH_FULL_IMAGE:figures/full_fig_p191_4_18.png] view at source ↗
Figure 4.20
Figure 4.20. Figure 4.20: Comparison of the principal components of learned features from MCR2 versus those from cross entropy [PITH_FULL_IMAGE:figures/full_fig_p192_4_20.png] view at source ↗
Figure 4.21
Figure 4.21. Figure 4.21: Cosine similarity between learned features by using the MCR2 objective (left) and CE loss (right). ting. Contrary to supervised learning where the class labels are known, in unsu￾pervised learning, the group memberships of the data samples are unknown. In this case, we can apply certain class of augmentations or transformations, say τ with a distribution Pτ , to each sample. For example, for images, aug… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of principal components learned for class 2-‘Bird’ and class 8-‘Ship’. For each class j, we first compute the top-10 singular vectors of the SVD of the learned features Zj . Then for the l-th singular vector of class j, ul j , and for the feature of the i-th image of class j, zi j , we calculate the absolute value of inner product, |hzi j , ul j i|, then we select the top-10 images according … view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of top-10 “principal” images for each class in the CIFAR10 dataset. (a) For each class-j, we first compute the top-10 singular vectors of the SVD of the learned features Zj . Then for the l-th singular vector of class j, ul j , and for the feature of the i-th image of class j, zi j , we calculate the absolute value of inner product, |hzi j , ul j i|, then we select the largest one for each si… view at source ↗
Figure 4.23
Figure 4.23. Figure 4.23: Evolution of the rates of MCR2 in the training process for unsupervised learning on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p194_4_23.png] view at source ↗
Figure 4.24
Figure 4.24. Figure 4.24: Comparison of clustering performance accuracy between training [PITH_FULL_IMAGE:figures/full_fig_p194_4_24.png] view at source ↗
Figure 4.25
Figure 4.25. Figure 4.25: Visualization of unsupervised segmentation results from DINO ViT-B/16 (row 1), SimDINO ViT-B/16 (row 2) and SimDINO ViT-L/16 (row 3). 4.4 Summary and Notes Key messages. In this Chapter, we have studied basic concepts and ideas behind how to learn a distribution from finite samples and obtain a computable encoding and decoding scheme for the distribution. For distributions that are continuous and possib… view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Incremental deformation via gradient flow to both flatten data of [PITH_FULL_IMAGE:figures/full_fig_p204_5_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Interpretation of C ℓ k and E ℓ : C ℓ k compresses each class by contracting the features to a low-dimensional subspace; E ℓ expands all features by contrasting and repelling features across different classes. on features from the k-class and aims to compress them to reduce the coding rate of each class. Then the complete gradient ∂∆Rϵ ∂Z (Zℓ ) ∈ R d×N is of the form: ∂∆Rϵ ∂Z (Z ℓ ) = E ℓ |{z} Expansion … view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Network Architectures of the ReduNet and comparison with others. (a): Layer structure of the ReduNet derived from one iteration of gradient ascent for optimizing rate reduction. (b) (left): A layer of ResNet [HZR+16b]; and (b) (right): A layer of ResNeXt [XGD+17]. As we will see in Section 5.1.2, the linear operators E ℓ and C ℓ k of the ReduNet naturally become (multi-channel) convolutions when shift￾in… view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Left: a mixture of experts (MoE) deep network [ [PITH_FULL_IMAGE:figures/full_fig_p209_5_4.png] view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Original samples and learned representations for 3D Mixture of Gaussians. We visualize data points X (before mapping f(·, θ)) in (a) and learned features Z (after mapping f(·, θ)) in (b) by scatter plot. In each scatter plot, each color represents one class of samples. In (c), we also show the plots for the progression of values of the objective functions. instance, shifting an object slightly to the rig… view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Illustration of the sought representation that is equivariant/invariant [PITH_FULL_IMAGE:figures/full_fig_p211_5_6.png] view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: Each input signal x (an image here) can be represented as a super￾position of sparse convolutions with multiple kernels dc in a dictionary D. Σ¯ (z¯) .=    circ(z[1]) . . . circ(z[C])    [PITH_FULL_IMAGE:figures/full_fig_p214_5_7.png] view at source ↗
Figure 5.8
Figure 5.8. Figure 5.8: Estimate the sparse code z¯ of an input signal x (an image here) by taking convolutions with multiple kernels kc and then sparsifying. The ReduNet constructed from circulant version of these multi-channel fea￾tures Z¯ .= [z¯1, . . . , z¯N ] ∈ R C×d×N , i.e., circ(Z¯) .= [circ(z¯1), . . . , circ(z¯N )] ∈ R dC×dN , retains the good invariance properties described above: the linear operators, now denoted as… view at source ↗
Figure 5.9
Figure 5.9. Figure 5.9: The overall process for classifying multi-class signals with shift in [PITH_FULL_IMAGE:figures/full_fig_p217_5_9.png] view at source ↗
Figure 5.10
Figure 5.10. Figure 5.10: Examples of rotated images of MNIST digits, each by 18◦ . (Left) Diagram for polar coordinate representation; (Right) Rotated images of digit ‘0’ and ‘1’. Example 5.2 (Invariant Classification of Digits). We next provide an empirical performance of the ReduNet on learning rotation invariant features on the real 10-class MNIST dataset. We impose a polar grid on the image x ∈ R H×W , with its geometric ce… view at source ↗
Figure 5.11
Figure 5.11. Figure 5.11: (a)(b) are heatmaps of cosine similarity among rotated training data Xrotation and learned features Z¯rotation for rotation invariance. (d) visualizes the train￾ing/val MCR2 losses across layers. 5.2 White-Box Transformers from Unrolled Op￾timization As we have seen in the previous section, we use the problem of classification to provide a rigorous interpretation for main architectural characteristics o… view at source ↗
Figure 5.12
Figure 5.12. Figure 5.12: Comparison of three sets of representations via rate reduction and sparsity. Each Si represents one linear subspace, and the number of blue balls represents the difference between the coding rates ∆Rϵ(Z | U[K]) = Rϵ(Z) − R c ϵ(Z | U[K]). will see in [PITH_FULL_IMAGE:figures/full_fig_p222_5_12.png] view at source ↗
Figure 5.13
Figure 5.13. Figure 5.13: One layer of the CRATE encoder architecture. The full architec￾ture is simply a concatenation of such layers, with some initial tokenizer, pre-processing head, and final task-specific head (i.e., a classification head). In our implementation, we also add a non-negative constraint to Zℓ+1, and solve the corresponding non-negative LASSO: Z ℓ+1 ≈ arg min Z≥0  λkZk1 + 1 2 kZ ℓ+1/2 − DℓZk 2 F  . (5.2.19) T… view at source ↗
Figure 5.14
Figure 5.14. Figure 5.14: The ‘main loop’ of the CRATE white-box deep network design. After encoding input data as a sequence of tokens Z0 , CRATE con￾structs a deep network that transforms the data to a canonical configuration of low-dimensional subspaces by successive compression against a local model for the distribution, generating Zℓ+1/2 , and sparsification against a global dictionary, generating Zℓ+1. Repeatedly stacking … view at source ↗
Figure 5.15
Figure 5.15. Figure 5.15: The roles of forward pass and backward propagation in CRATE. (a) Given fixed subspaces and dictionaries {(U ℓ [K] , Dℓ )} L ℓ=1, each layer incre￾mentally optimizes the sparse rate reduction of the representations in the forward pass; (b) Backpropagation learn subspaces and dictionaries {(U ℓ [K] , Dℓ )} L ℓ=1 from training data. bution of its input, transforms this distribution towards a more parsimoni… view at source ↗
Figure 5.16
Figure 5.16. Figure 5.16: Optimization of compression and sparsity measures through the forward pass of CRATE. We measure the compression R c ϵ(Z ℓ+1/2 ) (a) and sparsity ∥Z ℓ+1∥1 (b) at the output of each layer. We observe that both measures decrease over the course of the forward pass, as predicted by theory, and almost every layer monotonically decreases each measure. (The exception is the last layer’s sparsity, since dense f… view at source ↗
Figure 5.17
Figure 5.17. Figure 5.17: Visualizing the attention maps of CRATE: what does the model look at to make a classification? We visualize the last-layer attention maps, which are the last source of token-to-token interaction in the network, and therefore in some sense the most refined description of what sections of the input image the model uses to make a prediction. We observe an extremely interesting behavior: when trained on pur… view at source ↗
Figure 5.18
Figure 5.18. Figure 5.18: Denoising performance of the attention-only transformer. Here, we sample initial token representations from a mixture of low-rank Gaus￾sians in Equation (5.3.1). Then, we apply (5.3.2) to update token representa￾tions and report the SNR at each layer. known subsets C1, . . . , CK, such that: zi = Ukai | {z } signal + X K j̸=k Ujei,j | {z } noise , ∀i ∈ Ck, (5.3.1) where ai i.i.d. ∼ N (0, Ipk ) and ei,j … view at source ↗
Figure 5.19
Figure 5.19. Figure 5.19: Details of the attention-only transformer architecture. Each layer consists of the MSSA operator and a skip connection. In addition, LayerNorm is included only for language tasks. In practice, backpropagation is applied to train the model parameters using training samples. signal-to-noise ratio (SNR) for each block of the token representations at the ℓ-th layer as follows: SNR(Z ℓ k ) .= kUkUT k Zℓ k kF… view at source ↗
Figure 5.20
Figure 5.20. Figure 5.20: One layer ℓ of the proposed Token Statistics Transformer (ToST). Notably, the self-attention of ToST transforms tokens Z ℓ efficiently to Z ℓ+1 , via multiplying each row of the projected token by only a scalar. This leads to reduced complexity of the attention: it has O(p) space and O(pn) time complexity, where p is the dimension of the projected tokens of each head, and n is the number of tokens. 0K 5… view at source ↗
Figure 5.21
Figure 5.21. Figure 5.21: Time and memory complexity of ToST in practice compared to empirically designed ViT (a) and GPT-2 (b). We observe that ToST only requires linear time and memory, in comparison to quadratic costs of traditional transformers. Transformer (ToST), visualized in [PITH_FULL_IMAGE:figures/full_fig_p238_5_21.png] view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: Learn a good auto-encoding representation for a general distribution [PITH_FULL_IMAGE:figures/full_fig_p249_6_1.png] view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: Illustration of a typical autoencoder such as PCA, seeking a low [PITH_FULL_IMAGE:figures/full_fig_p252_6_2.png] view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: A depiction of interpolation through manifold flattening on a man [PITH_FULL_IMAGE:figures/full_fig_p253_6_3.png] view at source ↗
Figure 6.4
Figure 6.4. Figure 6.4: Nonlinear PCA by autoassociative neural networks of depth two for [PITH_FULL_IMAGE:figures/full_fig_p254_6_4.png] view at source ↗
Figure 6.5
Figure 6.5. Figure 6.5: A depiction of the construction process of the flattening and recon [PITH_FULL_IMAGE:figures/full_fig_p255_6_5.png] view at source ↗
Figure 6.6
Figure 6.6. Figure 6.6: Illustration of a sparse autoencoder (SAE), compared to that of a [PITH_FULL_IMAGE:figures/full_fig_p256_6_6.png] view at source ↗
Figure 6.7
Figure 6.7. Figure 6.7: Diagram of the Latent Diffusion Model from the work of [ [PITH_FULL_IMAGE:figures/full_fig_p258_6_7.png] view at source ↗
Figure 6.8
Figure 6.8. Figure 6.8: Illustration of learning an autoencoding representation [PITH_FULL_IMAGE:figures/full_fig_p260_6_8.png] view at source ↗
Figure 6.9
Figure 6.9. Figure 6.9: A Closed-loop Transcription. The encoder f has dual roles: it learns a representation z for the data x via maximizing the rate reduction of z and it is also a “feedback sensor” for any discrepancy between the data x and the decoded xˆ. The decoder g also has dual roles: it is a “controller” that corrects the discrepancy between x and xˆ and it also aims to minimize the overall coding rate for the learned… view at source ↗
Figure 6.10
Figure 6.10. Figure 6.10: Embeddings of low-dim submanifolds in a high-dim space. Sx (blue) is the submanifold for the original data x; Sz (red) is the image of Sx under the mapping f, representing the learned feature z; and the green curve is the image of the feature z under the decoding mapping g. (6.2.15) and (6.2.12), a closed-loop notion of “distance” between X and Xˆ can be computed as an equilibrium point to the following… view at source ↗
Figure 6.11
Figure 6.11. Figure 6.11: Visualizing the alignment between Z and Zˆ: |Z⊤Zˆ| in the feature space for (a) MNIST, (b) CIFAR-10, and (c) ImageNet-10-Class. an error-reducing controller. The remaining question is whether the above framework can indeed learn a good (autoencoding) representation of a given dataset? Before we give some for￾mal theoretical justification (in the next subsection), we present some empirical results. Visua… view at source ↗
Figure 6.12
Figure 6.12. Figure 6.12: Visualizing the auto-encoding property of the learned closed-loop [PITH_FULL_IMAGE:figures/full_fig_p270_6_12.png] view at source ↗
Figure 6.13
Figure 6.13. Figure 6.13: Overall framework of our closed-loop transcription-based incre￾mental learning for a structured LDR memory. Only a single, entirely self￾contained, encoding-decoding network is needed: for a new data class Xnew, a new LDR memory Znew is incrementally learned as a minimax game be￾tween the encoder and decoder subject to the constraint that old memory of past classes Zold is intact through the closed-loop… view at source ↗
Figure 6.14
Figure 6.14. Figure 6.14: Visualizing the auto-encoding property of the learned representation (Xˆ = g ◦ f(X)) [PITH_FULL_IMAGE:figures/full_fig_p276_6_14.png] view at source ↗
Figure 6.15
Figure 6.15. Figure 6.15: Block diagonal structure of |Z ⊤Z| in the feature space for MNIST (left) and CIFAR-10 (right). textures. For a simpler dataset like MNIST, the replayed Xˆ are almost identi￾cal to the input X! This is rather remarkable given: (1) our method does not explicitly enforce xˆ ≈ x for individual samples as most autoencoding methods do, and (2) after having incrementally learned all classes, the generator has … view at source ↗
Figure 6.16
Figure 6.16. Figure 6.16: Visualization of 5 reconstructed xˆ = g(z) from z’s with the closest distance to (top-4) principal components of learned features for MNIST (class ‘4’ and class ‘7’) and CIFAR-10 (class ‘horse’ and ‘ship’). Replay images of samples from principal components. Since features of each class can be modeled as a principal subspace, we further visualize the indi￾vidual principal components within each of those… view at source ↗
Figure 6.17
Figure 6.17. Figure 6.17: Visualization of replayed images xˆold of class 1-‘airplane’ in CIFAR-10, before (left) and after (right) one reviewing cycle. Sample-wise constraints for unsupervised transcription. To improve discriminative and generative properties of representations learned in the unsu￾pervised setting, we propose two additional mechanisms for the above CTRL￾Binary maximin game (6.3.6). For simplicity and uniformity… view at source ↗
Figure 6.18
Figure 6.18. Figure 6.18: Overall framework of closed-loop transcription for unsupervised learning. Two additional constraints are imposed on the CTRL-Binary method: 1) self-consistency for sample-wise features z i and zˆ i , say z i ≈ zˆ i ; and 2) invariance/similarity among features of augmented samples z i and z i a , say z i ≈ z i a = f(τ (x i ), θ), where x i a = τ (x i ) is an augmentation of sample x i via some transform… view at source ↗
Figure 6.19
Figure 6.19. Figure 6.19: Emergence of block-diagonal structures of |Z ⊤Z| in the feature space for CIFAR-10. max θ Rϵ(Z) + ∆Rϵ(Z, Zˆ) − λ1 X i∈N ∆Rϵ(z i , z i a ) − λ2 X i∈N ∆Rϵ(z i , zˆ i ) (6.3.10) min η Rϵ(Z) + ∆Rϵ(Z, Zˆ) + λ1 X i∈N ∆Rϵ(z i , z i a ) + λ2 X i∈N ∆Rϵ(z i , zˆ i ), (6.3.11) where the constraints P i∈N ∆Rϵ(z i , zˆ i ) = 0 and P i∈N ∆Rϵ(z i , z i a ) = 0 in (6.3.9) have been converted (and relaxed) to Lagrangian… view at source ↗
Figure 6.20
Figure 6.20. Figure 6.20: t-SNE visualizations of learned features of CIFAR-10 with different models. tion can be very useful for generative purposes. For example, we can organize the sample features into meaningful clusters, and model them with low-dimensional (Gaussian) distributions or subspaces. By sampling from these compact mod￾els, we can conditionally regenerate meaningful samples from computed clusters. This is known as… view at source ↗
Figure 6.21
Figure 6.21. Figure 6.21: Unsupervised conditional image generation from each cluster of CIFAR￾10, using u-CTRL. Images from different rows mean generation from different principal components of each cluster. (such as discriminative classification, with AlexNet [KSH12], or generative mod￾eling with GPT architectures [BMR+20]). Works we have featured throughout the chapter, especially the work of [HS06], served as catalysts of re… view at source ↗
Figure 7.1
Figure 7.1. Figure 7.1: Inference with low-dimensional distributions. This is the generic picture for this chapter: we have a low-dimensional distribution for x ∈ R D (here depicted as a union of two 2-dimensional manifolds in R 3 ) and a measurement model y = h(x) + w ∈ R d . We want to infer various things about this model, including the conditional distribution of x given y, or the conditional expectation E[x | y], given var… view at source ↗
Figure 7.2
Figure 7.2. Figure 7.2: Left: image completion. Right: text prediction. In particular, text [PITH_FULL_IMAGE:figures/full_fig_p286_7_2.png] view at source ↗
Figure 7.3
Figure 7.3. Figure 7.3: Comparison of three different surrogates for an estimate of [PITH_FULL_IMAGE:figures/full_fig_p287_7_3.png] view at source ↗
Figure 7.4
Figure 7.4. Figure 7.4: Illustration of the continuation process of enforcing the constraints [PITH_FULL_IMAGE:figures/full_fig_p290_7_4.png] view at source ↗
Figure 7.5
Figure 7.5. Figure 7.5: Illustration of completing an image as low-rank matrix with some [PITH_FULL_IMAGE:figures/full_fig_p293_7_5.png] view at source ↗
Figure 7.6
Figure 7.6. Figure 7.6: Diagram of the overall (masked) autoencoding process. The (image) token representations are transformed iteratively towards a parsimonious (e.g., compressed and sparse) representation by each encoder layer f ℓ . Furthermore, such representations are transformed back to the original image by the decoder layers g ℓ . Each encoder layer f ℓ is meant to be (partially) inverted by a corresponding decoder laye… view at source ↗
Figure 7.7
Figure 7.7. Figure 7.7: Diagram of each encoder layer (top) and decoder layer (bottom). Notice that the two layers are highly anti-parallel — each is constructed to do the operations of the other in reverse order. That is, in the decoder layer g ℓ , the ISTA block of f L−ℓ is partially inverted first using a linear layer, then the MSSA block of f L−ℓ is reversed; this order unravels the transformation done in f L−ℓ . Masked ViT… view at source ↗
Figure 7.8
Figure 7.8. Figure 7.8: Autoencoding visualizations of CRATE-Base and ViT-MAE￾Base [HCX+22] with 75% patches masked. We observe that the reconstructions from CRATE-Base are on par with the reconstructions from ViT-MAE-Base, despite using < 1/3 of the parameters. try to estimate the masked part Xm = PΩc (X). For realizations (Ξv, Ξm) of the random variable X = (Xv, Xm), let pXm|Xv (Ξm | Ξv) be the conditional distribution of Xm … view at source ↗
Figure 7.9
Figure 7.9. Figure 7.9: Sampling visualizations from models trained via ambient dif￾fusion [DSD+23b] with 80% of the pixels masked. Using a similar ratio of masked pixels as in [PITH_FULL_IMAGE:figures/full_fig_p297_7_9.png] view at source ↗
Figure 7.10
Figure 7.10. Figure 7.10: Statistical dependency diagrams for the conditional sampling pro [PITH_FULL_IMAGE:figures/full_fig_p299_7_10.png] view at source ↗
Figure 7.11
Figure 7.11. Figure 7.11: Numerical simulation of the conditional sampling setup ( [PITH_FULL_IMAGE:figures/full_fig_p304_7_11.png] view at source ↗
Figure 7.12
Figure 7.12. Figure 7.12: Simplified overview of magnetic resonance imaging (MRI) recon [PITH_FULL_IMAGE:figures/full_fig_p307_7_12.png] view at source ↗
Figure 7.13
Figure 7.13. Figure 7.13: Visual comparison of MRI reconstruction methods. Three example brain MRI reconstructions showing FISTA-TV, the measurement con￾sistency diffusion approach of Song et al. [SSX+22], and ground truth images on the BraTS MRI reconstruction dataset at 8x undersampling. The diffusion￾based approach produces reconstructions that more closely match the ground truth structure compared to FISTA-TV, which can be s… view at source ↗
Figure 7.14
Figure 7.14. Figure 7.14: A high-level schematic of training and applying a text-to-image [PITH_FULL_IMAGE:figures/full_fig_p326_7_14.png] view at source ↗
Figure 7.15
Figure 7.15. Figure 7.15: Relationship between a 3D object/scene and its 2D projections. [PITH_FULL_IMAGE:figures/full_fig_p329_7_15.png] view at source ↗
Figure 7.16
Figure 7.16. Figure 7.16: Inference with distributed measurements. We have a low￾dimensional distribution x (here, similarly to [PITH_FULL_IMAGE:figures/full_fig_p329_7_16.png] view at source ↗
Figure 8.1
Figure 8.1. Figure 8.1: A diagram of the encoder pipeline. Data X ∈ D is fed through the embedding f emb θ to get a sequence in (R d ) ∗ . The embedding is fed through a backbone f bb θ to get features Zθ(X) for each token. We can extract an aggregate feature zθ(X) using the extraction map f ext θ . Finally, to use the aggregate feature in downstream tasks, we can use the task-specific head hθ [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 8.2
Figure 8.2. Figure 8.2: A diagram of the autoencoder pipeline. Data X ∈ D is fed through the embedding f emb θ to get a sequence in (R d ) ∗ . The embedding is fed through an encoder backbone f bb θ to get features Zθ(X) for each token. To decode Zθ(X), we pass it through a decoder backbone g bb η . To map the decoder backbone output back to data space D, we use an unembedding layer g unemb η , overall obtaining a reconstructio… view at source ↗
Figure 8.3
Figure 8.3. Figure 8.3: Images from ImageNet-1K (left) and CIFAR-10 (right). Notice that the CIFAR-10 images are much lower resolution, generally speaking, reducing the complexity of learning that distribution. For training, we will use the ImageNet-1K and ImageNet-21K datasets. Each sample in the dataset is an RGB image, of varying resolution, and a label indi￾cating the object or scene that the image contains (i.e., the class… view at source ↗
Figure 8.4
Figure 8.4. Figure 8.4: Local and global views in DINO. Local views and global views take a rectangular crop of the input image and resize it to a square shape, which is then input into the network for processing. p = q, and it makes p and q have minimal entropy (i.e., vectors with 1 in one component and 0 elsewhere — these are called one-hot vectors). Overall, the goal of this objective is not just to match p and q but also to… view at source ↗
Figure 8.5
Figure 8.5. Figure 8.5: An example of an image turned into 5×5 square patches, which are placed in raster order. Each patch is of the same size, and the grid of patches is of shape (NH, NW ) = (5, 5). The grid of patches is then unrolled into a sequence of length 5 × 5 = 25 in raster order. Before, collapse was avoided by using tricks to update µ and τ . In our simplification, if we compare the features within the representatio… view at source ↗
Figure 8.6
Figure 8.6. Figure 8.6: The transformer embedding pipeline. Given a sequence of un￾rolled patches in raster order Xpatch, each unrolled patch is linearly projected into the feature space, and equipped with an (additive) positional encoding and an ad￾ditional token known as the class token. The output is the first-layer-input feature Z 1 θ (X) = f emb θ (X). 1. First, we turn the image data X into a sequence of patches of shape … view at source ↗
Figure 8.7
Figure 8.7. Figure 8.7: One layer f ℓ θ of the transformer backbone. The input features go through layer-normalization, multi-head self-attention, and multi-layer perceptron blocks in sequence to form the output features of the layer. during training, and interpolate to get the positional encodings for smaller-sized inputs. Thus, in the end we have f emb θ (X) .= [PITH_FULL_IMAGE:figures/full_fig_p347_8_7.png] view at source ↗
Figure 8.8
Figure 8.8. Figure 8.8: The DINO pipeline. Student features and teacher features are com￾puted for each input. The objective attempts to align the student features with the teacher features by projecting both sets of features into a high-dimensional probability simplex and computing a cross-entropy loss. Notably, because of the “stop-grad”, the gradient is only computed w.r.t. the student parameters’ outputs. where mean(z) = 1 … view at source ↗
Figure 8.9
Figure 8.9. Figure 8.9: The SimDINO pipeline. Here, in contrast to the DINO pipeline in [PITH_FULL_IMAGE:figures/full_fig_p352_8_9.png] view at source ↗
Figure 8.10
Figure 8.10. Figure 8.10: A qualitative comparison of saliency maps generated by DINO (middle row) and by SimDINO (bottom row). For each image, we compute and display the average saliency map in the last layer L. The saliency maps are similar across models, meaning that all models converge to a similar notion of what objects are important. Note that although Xeval is a square image, it is interpolated back into rectangular shape… view at source ↗
Figure 8.11
Figure 8.11. Figure 8.11: The original captions (top) and their negative counterparts [PITH_FULL_IMAGE:figures/full_fig_p366_8_11.png] view at source ↗
Figure 8.12
Figure 8.12. Figure 8.12: Interpretable saliency maps in CRATE with patch size 8. When given images with similar properties (perhaps but not necessarily from the same class), the saliency maps corresponding to different attention heads in the last layer each highlight a specific property. One can observe that the average saliency map (not included) then highlights all relevant objects in the image, showing that it uses all fine￾… view at source ↗
Figure 8.13
Figure 8.13. Figure 8.13: One layer of the encoder and decoder in a CRATE autoencoder backbone. The encoder and decoder layers both feed their inputs through multi-head subspace self-attention and a dictionary learning or dictionary encoding step. Note that the encoder and decoder layers are symmetrically designed; the conceptual goal of each decoder layer is to invert an encoder layer, so this symmetry is very much by design (s… view at source ↗
Figure 8.14
Figure 8.14. Figure 8.14: Saliency maps of CRATE-MAE. Each pair of images consists of the original image (left) and a selected saliency map (right) corresponding to an atten￾tion head in the last layer. As is usual for CRATE models, but unusual for general transformer-like models, the saliency maps correspond to the objects in the input im￾age. similar parameter counts, and also that the feature learning performance (as measured… view at source ↗
Figure 8.15
Figure 8.15. Figure 8.15: RAE Reconstruction examples. From left to right: input image, RAE (DINOv2-B), RAE (Siglp-B), RAE (MAE-B), SD-VAE. Zoom in for details. fine-grained textures, object boundaries, and semantic details as faithfully as the SD-VAE baseline, consistent with its better rFID scores. 8.6.3 Sampling from Learned Representations via Denois￾ing Once we have learned a consistent and structured representation of the … view at source ↗
Figure 8.16
Figure 8.16. Figure 8.16: The three-stage inference pipeline of Stable Diffusion. (1) Text Encoder (e.g., CLIP) converts the input prompt c into semantic em￾beddings. (2) Generation Model (a time-conditional U-Net ϵθ) iteratively denoises a random latent tensor zT to produce a clean intermediate represen￾tation zˆ, conditioned on the text embeddings via cross-attention. (3) Decoder (from a VAE) maps the denoised latent zˆ back t… view at source ↗
Figure 8.17
Figure 8.17. Figure 8.17: ControlNet architecture pipeline. A trainable copy of the Stable Diffusion (SD) U-Net encoder and middle blocks, conditioned on a hint image via the Hint Encoder Ehint, injects feature maps into the frozen SD UNet decoder through zero convolutions to enable spatial control. (Image ex￾ample from https://github.com/huggingface/diffusers/blob/main/docs/ source/en/using-diffusers/controlnet.md). model from … view at source ↗
Figure 8.18
Figure 8.18. Figure 8.18: The training and inference pipeline of DreamBooth. During subject-driven finetuning, the text-to-image model is updated using a combined loss: a reconstruction loss to learn the specific subject (e.g., “[V] cat”) and a prior preservation loss to maintain generic class knowledge (e.g., “a cat”). At inference, the personalized model can generate the specific subject in novel contexts (e.g., “laying on a w… view at source ↗
Figure 8.19
Figure 8.19. Figure 8.19: The Gen4Gen pipeline for multi-concept personalization. To overcome the limitations of existing datasets in multi-subject scenarios, Gen4Gen [YCH+24] automates the creation of a high-quality benchmark (My￾Canvas). Training on this data significantly boosts the model’s ability to gener￾ate complex scenes with multiple specific subjects (e.g., “a [V 1] cat and a [V 2] dog”). 8.7.4 Extension to Video Gener… view at source ↗
Figure 8.20
Figure 8.20. Figure 8.20: Illustrations of different 3D representations for Stanford Bunny. (a) the point cloud; (b) the voxel; (c) the polygon mesh; (d) the implicit function. provides a brief introduction to 3D data. Readers wishing to learn more details are advised to consult the sources via the cited references. 3D data to which we refer here mainly mean continuous real physical 3D shapes that are typically digitized into di… view at source ↗
Figure 8.21
Figure 8.21. Figure 8.21: Alignment-before-generation pipeline. The approach con￾tains two models: the Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and the Aligned Shape Latent Diffusion Model (ASLDM). The SITA-VAE consists of four modules: an image encoder, a text encoder, a 3D shape encoder, and a 3D shape decoder. Encoders encode input pairs into an aligned space, and the 3D shape decoder reconstructs 3D shape… view at source ↗
Figure 8.22
Figure 8.22. Figure 8.22: Visual results for image-conditioned generation compari￾son. The figure shows that 3DILG [ZNW22b] generates over-smooth surfaces and lacks details of shapes, whereas 3DS2V [ZTN+23] generates few details with noisy and discontinuous surfaces of shapes. In contrast to baselines, Michelan￾gelo produces smooth surfaces and portrays shape details. Please zoom in for more visual details [PITH_FULL_IMAGE:figu… view at source ↗
Figure 8.23
Figure 8.23. Figure 8.23: Visual results for text-conditioned generation comparison. In the first two rows, the models are tested with abstract texts, and the result shows that only Michelangelo generates a 3D shape that conforms to the target text with a smooth surface and fine details. The last two rows show the results given texts containing detailed descriptions, which further demonstrates that Michelangelo can capture both … view at source ↗
Figure 8.24
Figure 8.24. Figure 8.24: Ablation study the effectiveness of training generative model in the aligned space. This figure illustrates visual comparisons for ablation studies on the effectiveness of training the generative model in the aligned space. Compared with the lower samples based on the conditional texts, the upper samples are closer to the conditions semantically, which indicates the effectiveness of the training generat… view at source ↗
Figure 8.25
Figure 8.25. Figure 8.25: Ablation study of the effectiveness of vision-language mod￾els and the impact of learnable query embeddings. This figure shows the ablation study on the effectiveness of the vision-language model and the impact of learnable query embeddings. According to the table, the model with CLIP and 512 learnable query embeddings achieves the best reconstruction and classification performance, indicating its abili… view at source ↗
Figure 8.26
Figure 8.26. Figure 8.26: Nearest Neighbor Analysis. We traverse the whole training set to find three nearest neighbors for the generated 3D shapes, and the results reveal that our model could produce novel 3D shapes based on given images instead of memorizing specific ones. Discussion. Though the Michelangelo project has shown that we can achieve excellent results in generating 3D objects, it still has some limitations. First, … view at source ↗
Figure 8.27
Figure 8.27. Figure 8.27: Task: to infer the 3D shape of an object with a canonical pose relative to the camera pose from a single image. Left: Input image. Middle: Output recovers a canonical 3D model of the object (with intrinsic properties such as shape and texture) along with the exact camera pose (extrinsic property) that generates the input. Right: Re-rendering the 3D model from the estimated pose faithfully reproduces the… view at source ↗
Figure 8.28
Figure 8.28. Figure 8.28: Results for generative 3D reconstruction from a single test image. Given an input image (top left), Cupid estimates camera pose (bottom left) and reconstructs 3D model (bottom right), re-rendering the input (top right). It is robust to changes in scale, placement, and lighting while preserving fine details, and supports component-aligned scene reconstruction (bottom row). All results are produced in sec… view at source ↗
Figure 8.29
Figure 8.29. Figure 8.29: The Cupid Two-Stage Generative Reconstruction Pipeline. Given an input image, the first stage (GS) generates a coarse occu￾pancy cube and a UV cube, which encodes 3D-to-2D correspondences. A PnP solver recovers the camera pose P ∗ from these correspondences. The second stage (GL) is conditioned on this recovered pose. It injects pixel-aligned fea￾tures (sampled from DINOv2 and low-level feature maps) in… view at source ↗
Figure 8.30
Figure 8.30. Figure 8.30: Qualitative comparison on input view consistency. We ren￾der the input view using its generated camera pose. For view centric methods (LRM, LaRa), we use ground-truth intrinsic for rendering as they do not model intrinsic. Our method produces the highest-fidelity geometry and appearance; LRM hallucinates incorrect details, LaRa is overly blurry due to 2D diffusion inconsistencies, and 3D generation meth… view at source ↗
Figure 8.31
Figure 8.31. Figure 8.31: Qualitative comparison of various pose-aligned condition￾ing. Our method (e) achieves the best visual quality in terms of color fidelity and detail. of canonical 3D points x (k) i corresponding to pixels ui ∈ M(k) . To resolve the scale ambiguity inherent in independent generation, we formulate the alignment as a correspondence problem between the generated canonical space and the estimated camera space… view at source ↗
Figure 8.32
Figure 8.32. Figure 8.32: Component-aligned scene reconstruction. Each object is recon￾structed independently with Cupid; explicit 3D–2D correspondences enable precise placement into a shared frame. Applying these optimal transformations results in a metric-consistent scene com￾position where all generated components are correctly positioned and scaled relative to one another. Please see [PITH_FULL_IMAGE:figures/full_fig_p427_8… view at source ↗
Figure 8.33
Figure 8.33. Figure 8.33: Additional examples of component-aligned scene recon￾struction. For each example shown, the panels display: (top left) the input image, (top right or bottom left) the final rendered output, and (bottom) the reconstructed individual components, color-coded for clarity. S (k),t−1 uv ← Step  S (k),t uv , d(k) uv,t , (8.9.12) where Step(·) denotes the update rule of the flow sampler. The resulted feature … view at source ↗
Figure 8.34
Figure 8.34. Figure 8.34: Multi-view conditioning. Our decoupled joint modeling natu￾rally supports multi-view conditioning. With multiple input views available, we fuse the shared view-agnostic object latent across flow paths (like MultiDiffu￾sion [BYL+23]), enabling object and cameras refinement across all views. Top: inputs; Middle: reconstructed 3D object and camera poses; Bottom: rendered images and geometry. In the second … view at source ↗
Figure 8.35
Figure 8.35. Figure 8.35: Egocentric Human Motion Estimation. Given head poses from SLAM and egocentric images from a head-mounted device (left), EgoAllo estimates full body pose, height, and hand parameters in the world (allocentric) reference frame (right). The estimated bodies are grounded in the scene, with feet contacting the floor and hands at physically plausible positions. der rotates in three axes, the elbow primarily i… view at source ↗
Figure 8.36
Figure 8.36. Figure 8.36: Invariant Head Motion Conditioning. The conditioning rep￾resentation must be invariant to both spatial transformations (same motion at different world locations) and temporal shifts (same motion at different absolute times). Per-timestep canonicalization achieves both properties. Architecture. The model takes three inputs: • The noisy motion sequence xn at diffusion step n, tokenized per-timestep. • The… view at source ↗
Figure 8.37
Figure 8.37. Figure 8.37: Overview of components of the EgoAllo framework. The diffusion model is restricted to local body parameters. An invariant parame￾terization g(·) of SLAM (head) poses is used to condition a diffusion model. These can be placed into the global coordinate frame via global alignment to input poses. When available, egocentric video is used for hand detection, say via HaMeR [PSR+24], which can be incorporated… view at source ↗
Figure 8.38
Figure 8.38. Figure 8.38: Body Context Improves Hand Estimation. Blue: monocu￾lar hand estimates from HaMeR, which have accurate local pose but incorrect world-frame position due to scale/depth ambiguity. Purple: estimates with body context, correctly grounded in the world frame [PITH_FULL_IMAGE:figures/full_fig_p441_8_38.png] view at source ↗
Figure 8.39
Figure 8.39. Figure 8.39: Qualitative Results. Estimated body poses from real-world ego￾centric recordings, visualized in 3D scene reconstructions. The method produces physically plausible poses with appropriate grounding across diverse activities. 8.10.8 Discussion This application demonstrates how the conditional inference framework from Chapter 7 extends beyond images to structured, articulated outputs like human motion [PIT… view at source ↗
Figure 8.40
Figure 8.40. Figure 8.40: The process of tokenizing text data using BPE. (Image credit to https://huggingface.co/learn/nlp-course/chapter6/5). (Left) We begin by analyzing the given text corpus and constructing an initial vocabulary that consists of individual characters (or bytes in the case of byte-level BPE). Then, we compute the frequencies of adjacent character pairs in the corpus. This involves scanning the entire text and… view at source ↗
Figure 8.41
Figure 8.41. Figure 8.41: The loss curve of CRATE-GPT-Base trained on the Open￾WebText dataset. Model architecture. We use the GPT-2 tokenizer, which has vocabulary size V = 50257, including a special token for <|pad|>. 37 The context length is Nmax = 1024. The backbone model follows the GPT2-Base architecture [RWC+19] with the appropriate alterations to have causal CRATE layers, and we compare against GPT2-Small and GPT2-Base. … view at source ↗
Figure 8.42
Figure 8.42. Figure 8.42: One layer of the CRATE-α backbone. The difference from CRATE is that the ISTAℓ θ block is replaced by the ODLℓ θ block, which performs several ISTA steps with an overcomplete dictionary. 8.12 Scaling and Improving White-Box Trans￾formers In this last section, we will discuss several ways in which various parts of CRATE-type models can be scaled up or made more efficient for certain special tasks while s… view at source ↗
Figure 8.43
Figure 8.43. Figure 8.43: Saliency maps from CRATE-α with patch size 8. Each row is a different image and each column corresponds to a different attention head in the last layer. We observe that the saliency maps strongly correspond to the objects in the input image. 3. The output of the nonlinearity is the sparse codes of the input with respect to the dictionary. In practice, giving up (1) is less tractable for efficiency reaso… view at source ↗
Figure 8.44
Figure 8.44. Figure 8.44: One layer of the ToST backbone. Token representations go through layer-norms, the token statistics self-attention (TSSA) operator, and an MLP, in order to form the layer’s output. Datasets ToST-T(iny) ToST-S(mall) ToST-M(edium) XCiT-S XCiT-M ViT-S ViT-B(ase) # parameters 5.8M 22.6M 68.1M 24.9M 80.2M 22.1M 86.6 M ImageNet 67.3 77.9 80.3 80.5 81.5 79.8 81.8 ImageNet ReaL 72.2 84.1 85.6 85.6 85.9 85.6 86.7… view at source ↗
Figure 8.45
Figure 8.45. Figure 8.45: One layer of the AoT backbone. Token representations merely go through a layer-norm and the multi-head (subspace) self-attention operator to form the layer’s output. Notice that there is no token-wise nonlinearity such as MLP or ISTA or ODL. layer is simply of the form Z ℓ+1 θ (X) = Z ℓ θ (X) + MSSAℓ θ (LNℓ θ (Z ℓ θ (X))). (8.12.8) In our implementation, we also experimented with using multi-head self-a… view at source ↗
Figure 8.46
Figure 8.46. Figure 8.46: Evaluating models on language tasks. We plot the training loss (left) and validation loss (right) of the AoT and GPT-2 models pretrained on OpenWebText. large parameter sizes can achieve comparable performance to the GPT-2 base model. Moreover, we found that adding MLP layers to AoT does not improve the zero-shot performance. These results highlight the potential of attention￾only models to achieve comp… view at source ↗
Figure 9.1
Figure 9.1. Figure 9.1: From an open-ended deep network to a closed-loop system. [PITH_FULL_IMAGE:figures/full_fig_p464_9_1.png] view at source ↗
Figure 9.2
Figure 9.2. Figure 9.2: Conjectured architecture of the brain cortex. The cortex is a mas [PITH_FULL_IMAGE:figures/full_fig_p466_9_2.png] view at source ↗
Figure 9.3
Figure 9.3. Figure 9.3: Three tests for different levels or types of intelligence capabilities: [PITH_FULL_IMAGE:figures/full_fig_p471_9_3.png] view at source ↗
read the original abstract

In the current era of deep learning and especially generative models, there is significant investment in training very large deep neural networks. Thus far, such models have been "black boxes" that are difficult to understand in the sense that they have opaque internal mechanisms, leading to difficulties in interpretability, reliability, and control. Naturally, this lack of understanding has led to both hype and fear. This book is an attempt to "open the black box" and understand the mechanisms of large deep networks, through the perspective of representation learning, which is a major factor - arguably the single most important one - in the empirical power of deep learning models. A brief outline of this book is as follows. Chapter 1 will summarize the threads that underlie the whole text. Chapters 2, 3, 4, 5, and 6 will explain the design principles of modern neural network architectures through optimization and information theory, reducing the process of architecture development (long having been described as a sort of "alchemy") to undergraduate-level linear algebra and calculus exercises once the underlying principles are introduced. Chapters 7 and 8 will discuss applications of these principles to solve problems in more paradigmatic ways, obtaining new methods and models which are efficient, interpretable, and controllable by design, and yet no less - sometimes even more - powerful than the black-box models they resemble. Chapter 9 will discuss potential future directions for deep learning, the role of representation learning, as well as some open problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript is a book-length work proposing a mathematical theory of representation learning and memory in deep neural networks. Chapter 1 summarizes underlying threads; chapters 2–6 claim to derive the design principles of modern architectures from optimization and information theory, reducing architecture development to undergraduate linear algebra and calculus exercises; chapters 7–8 apply these principles to obtain new efficient, interpretable models; chapter 9 discusses future directions and open problems.

Significance. If the promised derivations in chapters 2–6 were supplied and shown to be parameter-free reductions grounded only in optimization and information theory, the work would offer a valuable contribution by replacing ad-hoc architecture search with explicit mathematical principles, improving interpretability and control. The framing around representation learning as the core driver of empirical success is a coherent organizing lens. No machine-checked proofs, reproducible code, or falsifiable predictions are present in the supplied text.

major comments (1)
  1. [Abstract] Abstract and chapter outline: the central claim that chapters 2–6 reduce architecture design to undergraduate linear algebra and calculus via optimization and information theory is asserted without any equations, derivations, worked examples, or even high-level proof sketches. This absence makes the load-bearing promise of the manuscript unverifiable from the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the central verifiability issue in the supplied text. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and chapter outline: the central claim that chapters 2–6 reduce architecture design to undergraduate linear algebra and calculus via optimization and information theory is asserted without any equations, derivations, worked examples, or even high-level proof sketches. This absence makes the load-bearing promise of the manuscript unverifiable from the text.

    Authors: The supplied manuscript text consists of the abstract and chapter outline; the detailed derivations promised for chapters 2–6 are not present. We agree that this renders the central claim unverifiable from the current text. We will revise by expanding the outline in Chapter 1 to incorporate high-level proof sketches and at least two worked examples (one from optimization and one from information theory) that illustrate the claimed reduction to undergraduate linear algebra and calculus. These additions will be placed before the chapter summaries so that readers can assess the approach without needing the full later chapters. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and chapter outline describe a high-level program for deriving neural architecture principles from optimization and information theory, but supply no equations, fitted parameters, self-citations, or uniqueness theorems that could be inspected for reduction to inputs. No load-bearing derivation chain is exhibited in the provided text, so the claimed reduction to undergraduate linear algebra remains an unverified assertion rather than a demonstrated circularity. This is the expected honest non-finding when concrete steps are absent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are described in the abstract; the work appears to rely on standard optimization and information-theoretic concepts from prior literature.

pith-pipeline@v0.9.1-grok · 5802 in / 979 out tokens · 33104 ms · 2026-06-28T03:04:06.576956+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 2 canonical work pages

  1. [1]

    Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry

    [AKH15] Yousset I Abdel-Aziz, Hauck Michael Karara, and Michael Hauck. “Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry” . Pho- togrammetric engineering & remote sensing 81.2 (2015), pp. 103–

  2. [2]

    Learning Sparsely Used Overcomplete Dictio- naries via Alternating Minimization

    [AAJ+16] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, and Pra- neeth Netrapalli. “Learning Sparsely Used Overcomplete Dictio- naries via Alternating Minimization” . SIAM Journal on Opti- mization 26.4 (2016), pp. 2775–2799. eprint: https://doi.org/ 10.1137/140979861. [AEB06] Michal Aharon, Michael Elad, and Alfred Bruckstein. “K-SVD: An algorithm f...

  3. [3]

    Why do deep convolutional networks generalize so poorly to small image transformations?

    Proceedings of Machine Learning Research. Paris, France: PMLR, July 2015, pp. 113–149. [A W18] Aharon Azulay and Yair Weiss. “Why do deep convolutional networks generalize so poorly to small image transformations?” arXiv preprint arXiv:1805.12177 (2018). [BJC85] B. Ans, J. Hérault, and C. Jutten. “Architectures neuromimé- tiques adaptatives : Détection de...

  4. [4]

    A fast iterative shrinkage-thresholding algorithm for linear inverse problems

    [BT09] Amir Beck and Marc Teboulle. “A fast iterative shrinkage-thresholding algorithm for linear inverse problems” . SIAM journal on imaging sciences 2.1 (2009), pp. 183–202. [BHM+19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. “Reconciling modern machine-learning practice and the classical bias–variance trade-off” .Proceedings of the Natio...

  5. [5]

    Computation of channel capacity and rate-distortion functions

    arXiv: 2409.20325 [cs.LG] . [Bla72] R. Blahut. “Computation of channel capacity and rate-distortion functions” .IEEE Transactions on Information Theory 18.4 (1972), pp. 460–473. [BRL+23] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. “Align your latents: High-resolution video synthesis with late...

  6. [6]

    Classifier-Free Guid- ance is a Predictor-Corrector

    [BN24b] Arwen Bradley and Preetum Nakkiran. “Classifier-Free Guid- ance is a Predictor-Corrector” . arXiv [cs.LG] (Aug. 2024). arXiv: 2408.09000 [cs.LG] . [BN20] Guy Bresler and Dheeraj Nagaraj. “Sharp representation theo- rems for relu networks with precise dependence on depth” . Pro- ceedings of the 34th International Conference on Neural Infor- mation ...

  7. [7]

    2020, pp

    Red Hook, NY, USA: Curran Associates Inc., Dec. 2020, pp. 10697–10706. [BB11] Haim Brezis and Haim Brézis. Functional analysis, Sobolev spaces and partial differential equations . Vol

  8. [8]

    Matrix Calculus (for Machine Learning and Beyond)

    [BEJ25] Paige Bright, Alan Edelman, and Steven G Johnson. “Matrix Calculus (for Machine Learning and Beyond)” . arXiv preprint arXiv:2501.14787 (2025). [BDS19] Andrew Brock, Jeff Donahue, and Karen Simonyan. “Large Scale GAN Training for High Fidelity Natural Image Synthesis” . Inter- national Conference on Learning Representations (ICLR)

  9. [9]

    Lan- guage models are few-shot learners

    [BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateu...

  10. [10]

    Invariant Scattering Convo- lution Networks

    [BM13] Joan Bruna and Stéphane Mallat. “Invariant Scattering Convo- lution Networks” . IEEE Transactions on Pattern Analysis and Machine Intelligence 35.8 (2013), pp. 1872–1886. §B.3 515 [BGW21] Sam Buchanan, Dar Gilboa, and John Wright. “Deep Networks and the Multiple Manifold Problem” . International Conference on Learning Representations

  11. [11]

    On the edge of memorization in diffusion models

    [BPM+25] Sam Buchanan, Druv Pai, Yi Ma, and Valentin De Bortoli. “On the edge of memorization in diffusion models” . arXiv preprint arXiv:2508.17689 (2025). [CD91] M. Frank Callier and A. Charles Desoer. Linear System Theory . Springer-Verlag,

  12. [12]

    Decoding by linear programming

    [CT05a] E. Candès and T. Tao. “Decoding by linear programming” . IEEE Transactions on Information Theory 51.12 (2005). [CT05b] E. Candès and T. Tao. “Error Correction via Linear Program- ming” .IEEE Symposium on FOCS (2005), pp. 295–308. [CMM+21] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised Lea...

  13. [13]

    Emerging prop- erties in self-supervised vision transformers

    09882 [cs.CV] . [CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. “Emerging prop- erties in self-supervised vision transformers” . Proceedings of the IEEE/CVF international conference on computer vision . 2021, pp. 9650–9660. [Cha66] Gregory J. Chaitin. “On the Length of Programs for Compu...

  14. [14]

    Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data

    [CHZ+23] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. “Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data” . International Con- ference on Machine Learning . PMLR. 2023, pp. 4672–4712. [CRB+18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. “Neural ordinary differential...

  15. [15]

    Exploring low-dimensional subspace in diffusion mod- els for controllable image editing

    [CZG+24] Siyi Chen, Huijie Zhang, Minzhe Guo, Yifu Lu, Peng Wang, and Qing Qu. “Exploring low-dimensional subspace in diffusion mod- els for controllable image editing” . Advances in Neural Informa- tion Processing Systems 37 (2024), pp. 27340–27371. [CZL+25] Siyi Chen, Yimeng Zhang, Sijia Liu, and Qing Qu. “The Dual Power of Interpretable Token Embedding...

  16. [16]

    Group equivariant convolutional networks

    [CW16a] Taco Cohen and Max Welling. “Group equivariant convolutional networks” .International Conference on Machine Learning . 2016, pp. 2990–2999. [CW16b] Taco Cohen and Max Welling. “Group equivariant convolutional networks” .International conference on machine learning . PMLR. 2016, pp. 2990–2999. [CW16c] Taco S. Cohen and Max Welling. “Group Equivaria...

  17. [17]

    Support-Vector Networks

    07576. [CV95] Corinna Cortes and Vladimir Vapnik. “Support-Vector Networks” . Mach. Learn. 20.3 (1995), pp. 273–297. [CT91] T. Cover and J. Thomas. Elements of Information Theory . Wiley Series in Telecommunications,

  18. [18]

    Geometrical and Statistical Properties of Sys- tems of Linear Inequalities with Applications in Pattern Recog- nition

    [Cov64] Thomas Cover. “Geometrical and Statistical Properties of Sys- tems of Linear Inequalities with Applications in Pattern Recog- nition” .IEEE TRANSACTIONS ON ELECTRONIC COMPUT- ERS (1964). [Cyb89] George V. Cybenko. “Approximation by superpositions of a sig- moidal function” .Mathematics of Control, Signals and Systems 2 (1989), pp. 303–314. [D D00]...

  19. [19]

    Distributional Diffusion Models with Scoring Rules

    Curran Associates, Inc., 2023, pp. 288–313. [DGG+25] Valentin De Bortoli, Alexandre Galashov, J Swaroop Guntupalli, Guangyao Zhou, Kevin Murphy, Arthur Gretton, and Arnaud Doucet. “Distributional Diffusion Models with Scoring Rules” . arXiv preprint arXiv:2502.02483 (2025). [DCM+23] Aaron Defazio, Ashok Cutkosky, Harsh Mehta, and Konstantin Mishchenko. “O...

  20. [20]

    Gromov– Wasserstein distances between Gaussian distributions

    [DDS22] Julie Delon, Agnes Desolneux, and Antoine Salmona. “Gromov– Wasserstein distances between Gaussian distributions” . Journal of Applied Probability 59.4 (2022), pp. 1178–1198. [DDS+09a] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Im- ageNet: A Large-Scale Hierarchical Image Database” . CVPR09

  21. [21]

    ImageNet: A Large-Scale Hierarchical Image Database

    [DDS+09b] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “ImageNet: A Large-Scale Hierarchical Image Database” . Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 2009, pp. 248–255. [DCL+19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirec...

  22. [22]

    Sparse components of images and optimal atomic decompositions

    [Don01] D L Donoho. “Sparse components of images and optimal atomic decompositions” . Constructive approximation 17.3 (Jan. 2001), pp. 353–382. [DVD+98] D L Donoho, M Vetterli, R A DeVore, and I Daubechies. “Data compression and harmonic analysis” . IEEE transactions on in- formation theory / Professional Technical Group on Information Theory 44.6 (Oct. 1...

  23. [23]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    [DFK+22] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vin- cent Vanhoucke. “Google scanned objects: A high-quality dataset of 3d scanned household items” . 2022 International Conference on Robotics and Automation (ICRA). IEEE. 2022, pp. 2553–2560. [DSC22] Shiv Ram Dubey, Satish Kumar Singh, ...

  24. [24]

    Toy models of superposition

    [EHO+22a] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. “Toy models of superposition” . arXiv preprint arXiv:2209.10652 (2022). [EHO+22b] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatf...

  25. [25]

    Taming trans- formers for high-resolution image synthesis

    [ERO21] Patrick Esser, Robin Rombach, and Bjorn Ommer. “Taming trans- formers for high-resolution image synthesis” . Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. 2021, pp. 12873–12883. [EGW+10] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classe...

  26. [26]

    Diffusion models and the man- ifold hypothesis: Log-domain smoothing is geometry adaptive

    arXiv: 0909.5206 [cs.CV] . [FPH+25] Tyler Farghly, Peter Potaptchik, Samuel Howard, George Deli- giannidis, and Jakiw Pidstrigach. “Diffusion models and the man- ifold hypothesis: Log-domain smoothing is geometry adaptive” . arXiv preprint arXiv:2510.02305 (2025). [FZS22] William Fedus, Barret Zoph, and Noam Shazeer. “Switch trans- formers: scaling to tri...

  27. [27]

    Implicit learning dynamics in stackelberg games: Equilibria characteriza- tion, convergence analysis, and empirical study

    §B.3 521 [FCR20] Tanner Fiez, Benjamin Chasnov, and Lillian Ratliff. “Implicit learning dynamics in stackelberg games: Equilibria characteriza- tion, convergence analysis, and empirical study” . International Conference on Machine Learning . PMLR. 2020, pp. 3133–3144. [FCR19] Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. “Conver- gence of learning...

  28. [28]

    Scaling and evaluating sparse autoencoders

    arXiv: 2304.14108 [cs.CV] . [GTT+25] Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. “Scaling and evaluating sparse autoencoders” . The Thirteenth In- ternational Conference on Learning Representations

  29. [29]

    Handbook of conver- gence theorems for (stochastic) gradient methods

    [GG23] Guillaume Garrigos and Robert M Gower. “Handbook of conver- gence theorems for (stochastic) gradient methods” . arXiv preprint arXiv:2301.11235 (2023). [GWX+25] Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, and Hao Zhao. “One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Random- ization for...

  30. [30]

    Generative adversarial nets

    [GPM+14b] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial nets” . Advances in neural infor- mation processing systems . 2014, pp. 2672–2680. [GDG+17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tul...

  31. [31]

    Should Penalized Least Squares Regression be Interpreted as Maximum A Posteriori Estimation?

    [Gri11] Rémi Gribonval. “Should Penalized Least Squares Regression be Interpreted as Maximum A Posteriori Estimation?” IEEE trans- actions on signal processing: a publication of the IEEE Signal Processing Society 59.5 (May 2011), pp. 2405–2410. §B.3 523 [GJB15] Remi Gribonval, Rodolphe Jenatton, and Francis Bach. “Sparse and spurious: Dictionary learning ...

  32. [32]

    Competitive Learning: From Interactive Ac- tivation to Adaptive Resonance

    [Gro87] Stephen Grossberg. “Competitive Learning: From Interactive Ac- tivation to Adaptive Resonance” . Cogn. Sci. 11 (1987), pp. 23–

  33. [33]

    On memorization in diffusion models

    [GDP+23] Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. “On memorization in diffusion models” . arXiv preprint arXiv:2310.02664 (2023). [GYR+23] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. “Animatediff: Animate your personalized text-to-image diffusion models wi...

  34. [34]

    Masked autoencoders are scalable vision learners

    [HCX+22] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dol- lár, and Ross Girshick. “Masked autoencoders are scalable vision learners” .Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 2022, pp. 16000–16009. [HFW+19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. “Momentum contrast for unsup...

  35. [35]

    Gans trained by a two time- scale update rule converge to a local nash equilibrium

    [HRU+17b] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. “Gans trained by a two time- scale update rule converge to a local nash equilibrium” . Advances in neural information processing systems 30 (2017). [HS06] G. E. Hinton and R. R. Salakhutdinov. “Reducing the Dimen- sionality of Data with Neural Networks” ...

  36. [36]

    Classifier-Free Diffusion Guid- ance

    2020, pp. 6840–6851. [HS21] Jonathan Ho and Tim Salimans. “Classifier-Free Diffusion Guid- ance” .NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications

  37. [37]

    Classifier-Free Diffusion Guid- ance

    [HS22a] Jonathan Ho and Tim Salimans. “Classifier-Free Diffusion Guid- ance” .arXiv [cs.LG] (July 2022). arXiv: 2207.12598 [cs.LG] . [HS22b] Jonathan Ho and Tim Salimans. “Classifier-Free Diffusion Guid- ance” .NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications

  38. [38]

    Long Short-term Mem- ory

    [HS97] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Mem- ory” .Neural computation 9 (Dec. 1997), pp. 1735–80. [HSD20] David Hong, Yue Sheng, and Edgar Dobriban. Selecting the num- ber of components in PCA via random signflips

  39. [39]

    Lrm: Large reconstruction model for single image to 3d

    arXiv: 2012.02985 [math.ST] . §B.3 525 [HZG+23] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. “Lrm: Large reconstruction model for single image to 3d” . arXiv preprint arXiv:2311.04400 (2023). [Hot33] H. Hotelling. “Analysis of a Complex of Statistical Variables into Principal Compo...

  40. [40]

    CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling

    [HDZ+25] Binbin Huang, Haobin Duan, Yiqun Zhao, Zibo Zhao, Yi Ma, and Shenghua Gao. “CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling” .arXiv preprint arXiv:2510.20776 (2025). [HYH+22] Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. “Ca...

  41. [41]

    eceptive fields of single neurones in the cat’s striate cortex

    IEEE. 1999, pp. 541–547. [HW59] D.H. Hubel and T.N. Wiesel. “eceptive fields of single neurones in the cat’s striate cortex” . J. Physiol. 148.3 (1959), pp. 574–591. [HCS+24] Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. “Sparse Autoencoders Find Highly In- terpretable Features in Language Models” . The Twelfth Interna- ...

  42. [42]

    Stein Latent Optimization for Gen- erative Adversarial Networks

    [HKJ+21] Uiwon Hwang, Heeseung Kim, Dahuin Jung, Hyemi Jang, Hyungyu Lee, and Sungroh Yoon. “Stein Latent Optimization for Gen- erative Adversarial Networks” . arXiv preprint arXiv:2106.05319 (2021). [Hyv05] Aapo Hyvärinen. “Estimation of Non-Normalized Statistical Mod- els by Score Matching” . Journal of Machine Learning Research 6.24 (2005), pp. 695–709...

  43. [43]

    Batch Normalization: Accel- erating Deep Network Training by Reducing Internal Covariate Shift

    [IS15] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accel- erating Deep Network Training by Reducing Internal Covariate Shift” .ICML. 2015, pp. 448–456. [JGB+21] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, An- drew Zisserman, and Joao Carreira. “Perceiver: General Percep- tion with Iterative Attention” . Proceedings of the 38th In...

  44. [44]

    Robust Compressed Sensing MRI with Deep Generative Priors

    Proceedings of Machine Learning Re- search. PMLR, 2021, pp. 4651–4664. [JAD+21] Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexan- dros G Dimakis, and Jon Tamir. “Robust Compressed Sensing MRI with Deep Generative Priors” . Advances in Neural Informa- tion Processing Systems . Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. W...

  45. [45]

    A simple proof of Stirling’s formula for the gamma function

    Curran Associates, Inc., 2021, pp. 14938–14954. [Jam15] G J O Jameson. “A simple proof of Stirling’s formula for the gamma function” .The Mathematical Gazette 99.544 (Mar. 2015), pp. 68–74. [JRR+24] Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam, and Victor Veitch. “On the Origins of Linear Representations in Large Language Models” ....

  46. [46]

    Lvsm: A large view synthesis model with minimal 3d inductive bias

    PMLR, 2024, pp. 21879– 21911. [JJT+24] Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. “Lvsm: A large view synthesis model with minimal 3d inductive bias” . arXiv preprint arXiv:2410.17242 (2024). [Jol02] I. Jollife. Principal Component Analysis . 2nd. Springer-Verlag,

  47. [47]

    Muon: An opti- mizer for hidden layers in neural networks, 2024

    [JJB+] Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. “Muon: An opti- mizer for hidden layers in neural networks, 2024” .URL https://kellerjordan. github. io/posts/muon 6 (). [JT20] Sheena A. Josselyn and Susumu Tonegawa. “Memory engrams: Recalling the past and imagining the future” . Science 367 ...

  48. [48]

    A new approach to linear filtering and prediction problems

    [Kal60] Rudolph Emil Kalman. “A new approach to linear filtering and prediction problems” (1960). [KG24] Mason Kamb and Surya Ganguli. “An analytic theory of creativ- ity in convolutional diffusion models” .arXiv preprint arXiv:2412.20292 (2024). [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Al...

  49. [49]

    FearNet: Brain-Inspired Model for Incremental Learning

    url: https://www.youtube.com/watch?v=VMj- 3S1tku0 (vis- ited on 08/17/2025). [KK18] Ronald Kemker and Christopher Kanan. “FearNet: Brain-Inspired Model for Incremental Learning” . International Conference on Learning Representations

  50. [50]

    3D Gaussian splatting for real-time radiance field rendering

    [KKL+23] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. “3D Gaussian splatting for real-time radiance field rendering. ”ACM Trans. Graph. 42.4 (2023), pp. 139–1. [KB14] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochas- tic optimization” . arXiv preprint arXiv:1412.6980 (2014). [KW13a] Diederik P Kingma and Max Wellin...

  51. [51]

    Nonlinear principal component analysis us- ing autoassociative neural networks

    [Kra91] Mark A Kramer. “Nonlinear principal component analysis us- ing autoassociative neural networks” .AIChE Journal 37.2 (1991), pp. 233–243. [KH+09] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images” (2009). [KNH14] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. “The CIF AR- 10 dataset” .online: http://...

  52. [52]

    Multi-concept customization of text-to-image diffusion

    [KZZ+23] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. “Multi-concept customization of text-to-image diffusion” .Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 2023, pp. 1931–1941. §B.3 529 [Lab24] Black Forest Labs. FLUX. https://github.com/black-forest- labs/flux

  53. [53]

    FLUX.1 Kontext: Flow matching for in-context image generation and edit- ing in latent space

    [LBB+25] Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack En- glish, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. “FLUX.1 Kontext: Flow matching for in-co...

  54. [54]

    The Nonlinear Statistics of High-Contrast Patches in Natural Images

    [LPM03] Ann Lee, Kim Pedersen, and David Mumford. “The Nonlinear Statistics of High-Contrast Patches in Natural Images” . Interna- tional Journal of Computer Vision 54 (Aug. 2003). [LSJ+16] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. “Gradient descent only converges to minimizers” . Confer- ence on learning theory . PMLR. 2016, pp. ...

  55. [55]

    O (d/T) convergence theory for diffusion probabilistic models under minimal assumptions

    Proceedings of Machine Learning Research. PMLR, July 2018, pp. 2965–2974. 530 Appendix B [LY24] Gen Li and Yuling Yan. “O (d/T) convergence theory for diffusion probabilistic models under minimal assumptions” . arXiv preprint arXiv:2409.18959 (2024). [LFD+22] Haochuan Li, Farzan Farnia, Subhro Das, and Ali Jadbabaie. “On convergence of gradient descent as...

  56. [56]

    Repair- ing Sparse Low-Rank Texture

    [LRZ+12] Xiao Liang, Xiang Ren, Zhengdong Zhang, and Yi Ma. “Repair- ing Sparse Low-Rank Texture” . Computer Vision – ECCV 2012 . Ed. by Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 482–495. [LMB+14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, ...

  57. [57]

    Multi- scale geometric methods for data sets I: Multiscale SVD, noise and curvature

    [LMR17] Anna V. Little, Mauro Maggioni, and Lorenzo Rosasco. “Multi- scale geometric methods for data sets I: Multiscale SVD, noise and curvature” . Applied and Computational Harmonic Analysis 43.3 (2017), pp. 504–567. [LMZ+24] Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi. “Infini-gram: Scaling unbounded n-gram language m...

  58. [58]

    Decoupled Weight Decay Reg- ularization

    [LH19] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Reg- ularization” .arXiv preprint arXiv:1711.05101 (2019). 532 Appendix B [MYP+10] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre. “Gen- eralized power method for sparse principal component analysis” . Journal of Machine Learning Research 11 (2010), pp. 517–553. [MDH+07a] Y. Ma, H. ...

  59. [59]

    Segmenta- tion of multivariate mixed data via lossy data coding and com- pression

    [MDH+07b] Yi Ma, Harm Derksen, Wei Hong, and John Wright. “Segmenta- tion of multivariate mixed data via lossy data coding and com- pression” . IEEE transactions on pattern analysis and machine intelligence 29.9 (2007), pp. 1546–1562. [MHN13] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. “Recti- fier nonlinearities improve neural network acoustic models”...

  60. [60]

    AMASS: Archive of motion capture as surface shapes

    [MGT+19] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. “AMASS: Archive of motion capture as surface shapes” .Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 2019, pp. 5442–

  61. [61]

    Sparse Modeling for Image and Vision Processing

    [MBP14] Julien Mairal, Francis Bach, and Jean Ponce. “Sparse Modeling for Image and Vision Processing” . Foundations and Trends® in Computer Graphics and Vision 8.2-3 (2014), pp. 85–283. [MSM93] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. “Building a Large Annotated Corpus of English: The Penn T reebank” . Computational Linguistics...

  62. [62]

    A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence: August 31, 1955

    [MMR+06] John McCarthy, Marvin L. Minsky, Nathaniel Rochester, and Claude E. Shannon. “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence: August 31, 1955” . AI Mag. 27.4 (Dec. 2006), pp. 12–14. §B.3 533 [MC89] Michael McCloskey and Neal J Cohen. “Catastrophic interfer- ence in connectionist networks: The sequential learning p...

  63. [63]

    A Logical Calculus of the Ideas Immanent in Nervous Activity

    Elsevier, 1989, pp. 109–165. [MP43] Warren McCulloch and Walter Pitts. “A Logical Calculus of the Ideas Immanent in Nervous Activity” . Bulletin of Mathematical Biophysics 5 (1943), pp. 115–133. [MM70] Jerry M. Mendel and Robert W. Mclaren. “Reinforcement-learning control and pattern recognition systems” . In Mendel, J. M. and Fu, K. S., editors, Adaptive...

  64. [64]

    Continuum percolation thresholds in two dimensions

    arXiv: 1609 . 07843 [cs.CL] . [MM12] Stephan Mertens and Cristopher Moore. “Continuum percolation thresholds in two dimensions” . Phys. Rev. E 86 (6 Dec. 2012), p. 061109. [MON+19] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. “Occupancy Networks: Learning 3D Reconstruction in Function Space” . 2019 IEEE/CVF Con...

  65. [65]

    SYMMETRIC GAUGE FUNCTIONS AND UNI- TARILY INV ARIANT NORMS

    [Mir60] L Mirsky. “SYMMETRIC GAUGE FUNCTIONS AND UNI- TARILY INV ARIANT NORMS” .The Quarterly Journal of Math- ematics 11.1 (Jan. 1960), pp. 50–59. [MCS+22] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shubham Tulsiani. “Autosdf: Shape priors for 3d completion, reconstruc- tion and generation” . Proceedings of the IEEE/CVF conference on computer vis...

  66. [66]

    Stochastic Models for Generic Images

    [MG99] David Mumford and Basilis Gidas. “Stochastic Models for Generic Images” .Quarterly of Applied Mathematics 59 (July 1999). [MK07] Joseph F Murray and Kenneth Kreutz-Delgado. “Learning sparse overcomplete codes for images” . The Journal of VLSI Signal Pro- cessing Systems for Signal Image and Video Technology 46.1 (Mar. 2007), pp. 1–13. [MLS94] R. Mu...

  67. [67]

    The cosparse analysis model and algorithms

    [NDE+13] S. Nam, M.E. Davies, M. Elad, and R. Gribonval. “The cosparse analysis model and algorithms” . Applied and Computational Har- monic Analysis 34.1 (2013), pp. 30–56. [NGE+20] Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. “Polygen: An autoregressive generative model of 3d meshes” . In- ternational conference on machine learning....

  68. [68]

    Point-e: A system for generating 3d point clouds from complex prompts

    [NJD+22] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. “Point-e: A system for generating 3d point clouds from complex prompts” .arXiv preprint arXiv:2212.08751 (2022). [ND21] Alexander Quinn Nichol and Prafulla Dhariwal. “Improved De- noising Diffusion Probabilistic Models” . International Conference on Machine Learning (ICML)

  69. [69]

    Towards a Mechanistic Explanation of Diffusion Model Generalization

    [NZM+24] Matthew Niedoba, Berend Zwartsenberg, Kevin Murphy, and Frank Wood. “Towards a Mechanistic Explanation of Diffusion Model Generalization” . arXiv preprint arXiv:2411.19339 (2024). [NMM19] Oliver Nina, Jamison Moody, and Clarissa Milligan. “A Decoder- Free Approach for Unsupervised Clustering and Manifold Learn- ing with Random Triplet Mining” .20...

  70. [70]

    Activation functions: Comparison of trends in practice and research for deep learning

    [NIG+18] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. “Activation functions: Comparison of trends in practice and research for deep learning” .arXiv preprint arXiv:1811.03378 (2018). [Oja82] Erkki Oja. “A simplified neuron model as a principal component analyzer” .Journal of Mathematical Biology 15 (1982), pp. 267–

  71. [71]

    A statis- tical theory of contrastive pre-training and multimodal generative AI

    [OLC+25] Kazusato Oko, Licong Lin, Yuhang Cai, and Song Mei. “A statis- tical theory of contrastive pre-training and multimodal generative AI” .arXiv [cs.LG] (Jan. 2025). arXiv: 2501.04641 [cs.LG] . [OF97] B A Olshausen and D J Field. “Sparse coding with an overcom- plete basis set: a strategy employed by V1?” Vision research 37.23 (Dec. 1997), pp. 3311–3...

  72. [72]

    Representa- tion Learning with Contrastive Predictive Coding

    [OL V18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representa- tion Learning with Contrastive Predictive Coding” . arXiv [cs.LG] (July 2018). arXiv: 1807.03748 [cs.LG] . [OVK17] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. “Neural discrete representation learning” . arXiv [cs.LG] (Nov. 2017). arXiv: 1711.00937 [cs.LG] . [Ope24] OpenAI...

  73. [73]

    Dinov2: Learn- ing robust visual features without supervision

    536 Appendix B [ODM+23] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haz- iza, Francisco Massa, Alaaeldin El-Nouby, et al. “Dinov2: Learn- ing robust visual features without supervision” . arXiv preprint arXiv:2304.07193 (2023). [ODM+24] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Hu...

  74. [74]

    Pursuit of a discriminative representa- tion for multiple subspaces via sequential games

    [PPC+23] Druv Pai, Michael Psenka, Chih-Yuan Chiu, Manxi Wu, Edgar Dobriban, and Yi Ma. “Pursuit of a discriminative representa- tion for multiple subspaces via sequential games” . Journal of the Franklin Institute 360.6 (2023), pp. 4135–4171. [PCY+23] Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Rich...

  75. [75]

    Prevalence of Neural Collapse during the terminal phase of deep learning train- ing

    arXiv: 1606.06031 [cs.CL] . [PHD20] Vardan Papyan, XY Han, and David L Donoho. “Prevalence of Neural Collapse during the terminal phase of deep learning train- ing” .arXiv preprint arXiv:2008.08186 (2020). [PRE17] Vardan Papyan, Yaniv Romano, and Michael Elad. “Convolu- tional neural networks analyzed via convolutional sparse coding” . The Journal of Mach...

  76. [76]

    The Linear Rep- resentation Hypothesis and the Geometry of Large Language Models

    [PCV24] Kiho Park, Yo Joong Choe, and Victor Veitch. “The Linear Rep- resentation Hypothesis and the Geometry of Large Language Models” .International Conference on Machine Learning . PMLR. 2024, pp. 39643–39666. §B.3 537 [Par04] Andrew Parker. In The Blink Of An Eye: How Vision Sparked The Big Bang Of Evolution . Basic Books,

  77. [77]

    On Lines and Planes of Closest Fit to Systems of Points in Space

    [Pea01] K. Pearson. “On Lines and Planes of Closest Fit to Systems of Points in Space” .Philosophical Magazine 2.6 (1901), pp. 559–572. [PX23] William Peebles and Saining Xie. “Scalable diffusion models with transformers” .Proceedings of the IEEE/CVF international con- ference on computer vision . 2023, pp. 4195–4205. [PV25] Liangzu Peng and René Vidal. “...

  78. [78]

    A self-supervised descriptor for im- age copy detection

    [PRR+22] Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. “A self-supervised descriptor for im- age copy detection” . Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2022, pp. 14532– 14542. [Pla99] S. E. Plamer. Vision Science: Photons to Phenomenology . The MIT Press,

  79. [79]

    Sdxl: Improving latent diffusion models for high-resolution im- age synthesis

    [PEL+23] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. “Sdxl: Improving latent diffusion models for high-resolution im- age synthesis” .arXiv preprint arXiv:2307.01952 (2023). [PW22] Yuri Poliyanski and Yihong Wu. Information Theory: From Cod- ing to Learning . Cambridge University Press,

  80. [80]

    Representation Learning via Manifold Flattening and Reconstruction

    [PPR+24] Michael Psenka, Druv Pai, Vishal Raman, Shankar Sastry, and Yi Ma. “Representation Learning via Manifold Flattening and Reconstruction” .Journal of Machine Learning Research 25.132 (2024), pp. 1–47. 538 Appendix B [QSM+17] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. “Pointnet: Deep learning on point sets for 3d classification and seg...

Showing first 80 references.