Position: Weight Space Should Be a First-Class Generative AI Modality

Kai Wang; Peihao Wang; Zhangyang Wang

arxiv: 2605.18632 · v1 · pith:EQ7YGT5Onew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Position: Weight Space Should Be a First-Class Generative AI Modality

Zhangyang Wang , Peihao Wang , Kai Wang This is my paper

Pith reviewed 2026-05-20 12:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords weight spacegenerative modelingneural network checkpointsmodel synthesisfirst-class modalityadapter generationfine-tuning alternativesstructured weight regions

0 comments

The pith

Treating neural network checkpoints as a first-class generative modality lets models be synthesized in weight space to match fine-tuning at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper argues that the millions of existing trained neural network checkpoints form a valuable data resource that should be modeled directly as a generative modality rather than starting from scratch for every new task. The key observation is that high-performing weights occupy low-dimensional structured regions shaped by symmetry, flatness, modularity, and shared subspaces, so generative models can sample new checkpoints on demand. A sympathetic reader would care because this could cut adaptation costs by orders of magnitude while shifting AI development from per-task optimization to sampling from learned weight distributions. The authors organize current methods into a five-stage pipeline, point to practical successes at adapter scale, and identify unrestricted frontier-scale synthesis as the remaining open problem.

Core claim

Neural network checkpoints should be treated as a first-class data modality, and generative modeling in weight space should be standardized as a core machine learning primitive. High-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces, allowing weights to be synthesized on demand that often match fine-tuning performance while reducing adaptation cost by orders of magnitude.

What carries the argument

Generative synthesis in weight space: learning distributions over trained checkpoints to sample new weight vectors that inherit the structural properties of high-performing models.

If this is right

New checkpoints can be created for specific tasks without running full fine-tuning or optimization from random initialization.
Adaptation to new domains or architectures becomes feasible at orders-of-magnitude lower compute cost than current practice.
Methods can be standardized into a five-stage pipeline covering data collection, representation learning, distribution modeling, sampling, and evaluation.
Practical deployment is already possible for adapter-scale and conditional generation settings.
AI systems can begin to improve or create other AI systems by sampling directly from learned weight distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model repositories could evolve from static collections into primary training corpora for meta-generative systems.
Conditional control over sampled weights might enable systematic creation of models with targeted properties such as efficiency or robustness.
The same low-dimensional structure could inform new approaches to model merging, compression, and modular composition.

Load-bearing premise

The structural properties observed in recent adapter-scale and conditional generation results will scale to unrestricted frontier-scale checkpoint synthesis without additional fundamental limitations.

What would settle it

An experiment in which generative synthesis from weight distributions fails to reach fine-tuning accuracy on a large new task, or in which no low-dimensional structured regions are found among frontier-model weights.

Figures

Figures reproduced from arXiv: 2605.18632 by Kai Wang, Peihao Wang, Zhangyang Wang.

**Figure 1.** Figure 1: A regime map for weight-space generation (see $3). We frame neural weight generation as conditional sampling from p(W | A, C, R), where W denotes generated weights or weight updates, A specifies the architecture graph and tensor schema, C encodes task or user conditions, and R captures training-recipe and checkpoint-lineage information. The three regimes distinguish whether architecture and conditioning ar… view at source ↗

read the original abstract

Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community's default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper frames weight space as a generative modality with a useful pipeline but leaves the key scaling claims to frontier models untested.

read the letter

Hi, The main thing to know is that this position paper pushes to treat neural network weights as a first-class generative modality and organizes recent work into a five-stage pipeline. It argues this could let us sample new models instead of fine-tuning per task, with big potential cost savings. What it does well is synthesize examples from adapter-scale and conditional generation, where synthesis already gets close to fine-tuning performance. The authors credit those results clearly and state outright that unrestricted frontier-scale checkpoint synthesis is still open. That keeps the discussion grounded. The soft spot is the leap to low-dimensional structure. The claim that high-performing models sit in regions shaped by symmetry, flatness, and modularity rests on cited advances rather than new measurements or scaling analysis here. No effective dimension checks or ablations appear, so the orders-of-magnitude efficiency argument at full scale remains an extrapolation. The stress-test concern about this untested scaling holds up. This is aimed at researchers working on model merging, hypernetworks, or self-improving systems. A reader who follows parameter-efficient methods will find the pipeline organizes the landscape in a straightforward way. It engages the literature honestly without contradictions or overclaims, so it deserves peer review to get feedback on the framing and what extra evidence would help.

Referee Report

2 major / 2 minor

Summary. The paper is a position paper arguing that neural network checkpoints should be treated as a first-class generative AI modality. It claims that high-performing models occupy low-dimensional, highly structured regions of weight space due to symmetry, flatness, modularity, and shared subspaces. Recent advances in weight synthesis are said to match fine-tuning performance at orders-of-magnitude lower adaptation cost. The authors organize existing methods into a five-stage pipeline, survey practical applications, and note that adapter-scale and conditional generation are advancing while unrestricted frontier-scale checkpoint synthesis remains open. The goal is to shift the community from per-task optimization toward sampling models from learned weight distributions.

Significance. If the position holds, it could drive a paradigm shift in machine learning by standardizing generative modeling over weight distributions, enabling AI systems to create or improve other models with substantially reduced compute. This would build directly on cited advances in adapters and conditional generation to realize large efficiency gains. The significance is tempered by the acknowledged open problem at frontier scale, but the framing as a core primitive could usefully redirect research priorities if the structural assumptions prove robust.

major comments (2)

[Structural fact paragraph] The paragraph beginning 'We contend that these results reflect an underlying structural fact': the central claim that observed synthesis results reflect low-dimensional, symmetric, flat, and modular structure enabling orders-of-magnitude cost reduction is asserted on the basis of adapter-scale and conditional-generation advances. No measurement of effective dimensionality, no scaling relation between manifold dimension and parameter count, and no ablation showing that these properties survive removal of adapters are supplied, leaving the extrapolation to unrestricted frontier-scale synthesis untested and load-bearing for the main thesis.
[Five-stage pipeline section] The section organizing existing methods into a five-stage pipeline: while the pipeline provides a useful taxonomy, the manuscript does not analyze how each stage would scale when the effective dimension of high-performing weight regions grows with model size, nor does it identify capacity limits of current generative models that could prevent the claimed cost reductions at frontier scale.

minor comments (2)

The abstract states that 'millions of trained weight vectors now exist' without a supporting citation or rough estimate of the current scale of public checkpoints.
[Applications survey] The survey of applications would benefit from explicit cross-references to the specific performance numbers or cost-reduction factors reported in the cited works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our position paper. We address the major comments below, clarifying our approach as a synthesis of existing work and outlining planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Structural fact paragraph] The paragraph beginning 'We contend that these results reflect an underlying structural fact': the central claim that observed synthesis results reflect low-dimensional, symmetric, flat, and modular structure enabling orders-of-magnitude cost reduction is asserted on the basis of adapter-scale and conditional-generation advances. No measurement of effective dimensionality, no scaling relation between manifold dimension and parameter count, and no ablation showing that these properties survive removal of adapters are supplied, leaving the extrapolation to unrestricted frontier-scale synthesis untested and load-bearing for the main thesis.

Authors: As a position paper, our intent is to highlight the implications of recent advances in weight-space generation rather than to conduct new empirical studies. The structural properties are supported by the body of cited work on neural network geometry. We will revise the relevant paragraph to explicitly note that the low-dimensional structure is inferred from adapter-scale results and to emphasize that extension to frontier-scale models is a motivating hypothesis rather than a proven fact. We will also incorporate additional citations on measurements of effective dimensionality in weight spaces to better ground the claim. revision: partial
Referee: [Five-stage pipeline section] The section organizing existing methods into a five-stage pipeline: while the pipeline provides a useful taxonomy, the manuscript does not analyze how each stage would scale when the effective dimension of high-performing weight regions grows with model size, nor does it identify capacity limits of current generative models that could prevent the claimed cost reductions at frontier scale.

Authors: We agree that a more explicit discussion of scaling would be beneficial. In the revised manuscript, we will add analysis to the pipeline section addressing how the stages might be affected by increasing effective dimensionality and the known limitations of current generative models (e.g., mode collapse or computational intractability in very high dimensions). This will better contextualize why unrestricted frontier-scale synthesis is presented as an open challenge. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper references external advances without internal reduction

full rationale

This is a high-level position paper that organizes existing methods into a five-stage pipeline and interprets recent external results as evidence for low-dimensional structure in weight space. No equations, fitted parameters, or derivations appear in the manuscript. The central contention that results 'reflect an underlying structural fact' is presented as an interpretive claim supported by cited prior work rather than a self-referential construction or load-bearing self-citation chain internal to this document. The paper explicitly flags frontier-scale synthesis as open, avoiding any claim that reduces to its own inputs by definition. This is the expected non-finding for a survey-style position statement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on one central domain assumption about the geometry of weight space and introduces no free parameters or new entities.

axioms (1)

domain assumption High-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces.
This structural fact is presented as the underlying reason generative modeling in weight space is feasible.

pith-pipeline@v0.9.0 · 5718 in / 1137 out tokens · 42386 ms · 2026-05-20T12:06:00.190538+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Permutation Symmetries and Quotient Geometry; Flatness and Low Intrinsic Dimension

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

[1]

K., Hayase, J., and Srinivasa, S

Ainsworth, S. K., Hayase, J., and Srinivasa, S. Git re-basin: Merging models modulo permutation symmetries.ICLR, 2023

work page 2023
[2]

Andreis, B., Soro, B., Torr, P., and Hwang, S. J. Set- based neural network encoding without weight tying. In NeurIPS, 2024

work page 2024
[3]

D., Kwon, S

Balzano, L., Ding, T., Haeffele, B. D., Kwon, S. M., Qu, Q., Wang, P., Wang, Z., and Yaras, C. An overview of low-rank structures in the training and adaptation of large models.arXiv, 2025

work page 2025
[4]

Revisiting model stitching to compare neural representations

Bansal, Y ., Nakkiran, P., and Barak, B. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

work page 2021
[5]

Understanding the role of individual units in a deep neural network.PNAS, 2020

Bau, D., Zhu, J.-Y ., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. Understanding the role of individual units in a deep neural network.PNAS, 2020

work page 2020
[6]

SMASH: One-shot model architecture search through hypernet- works

Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: One-shot model architecture search through hypernet- works. InICLR, 2018

work page 2018
[7]

Charakorn, R., Cetin, E., Tang, Y ., and Lange, R. T. Text-to- lora: Instant transformer adaption. InICML, 2025

work page 2025
[8]

Charakorn, R., Cetin, E., Uesaka, S., and Lange, R. T. Doc- to-lora: Learning to instantly internalize contexts.arXiv, 2026

work page 2026
[9]

Net2net: Accelerat- ing learning via knowledge transfer

Chen, T., Goodfellow, I., and Shlens, J. Net2net: Accelerat- ing learning via knowledge transfer. InICLR, 2016

work page 2016
[10]

The lottery ticket hypothesis for pre- trained bert networks

Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y ., Wang, Z., and Carbin, M. The lottery ticket hypothesis for pre- trained bert networks. InNeurIPS, 2020

work page 2020
[11]

Sym- bolic discovery of optimization algorithms

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y ., et al. Sym- bolic discovery of optimization algorithms. InNeurIPS, 2023

work page 2023
[12]

B., and LeCun, Y

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y . The loss surfaces of multilayer networks. InAISTATS, 2015

work page 2015
[13]

Discovering sym- bolic models from deep learning with inductive biases

Cranmer, K., Spergel, D., and Ho, S. Discovering sym- bolic models from deep learning with inductive biases. In NeurIPS, 2020. Csord´as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional modularity through differentiable weight masks. InICLR, 2021

work page 2020
[14]

Sharp minima can generalize for deep nets

Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y . Sharp minima can generalize for deep nets. InICML, 2017

work page 2017
[15]

Interpreting the weight space of customized diffusion models

Dravid, A., Gandelsman, Y ., Wang, K.-C., Abdal, R., Wet- zstein, G., Efros, A., and Aberman, K. Interpreting the weight space of customized diffusion models. InNeurIPS, 2024

work page 2024
[16]

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. InICML, 2018

work page 2018
[17]

The role of permutation invariance in linear mode connectivity of neural networks

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. The role of permutation invariance in linear mode connectivity of neural networks. InICLR, 2022. Erkoc ¸, Z., Ma, F., Shan, Q., Nießner, M., and Dai, A. Hy- perDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion. InICCV, 2023

work page 2022
[18]

Sharpness-aware minimization for efficiently improving generalization

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. InICLR, 2021

work page 2021
[19]

and Carbin, M

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR, 2019

work page 2019
[20]

S., Gupte, A., and Poggio, T

Galanti, T., Siegel, Z. S., Gupte, A., and Poggio, T. A. SGD and weight decay secretly minimize the rank of your neural network. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. 10 Position: Weight Space Should Be a First-Class Generative AI Modality

work page 2024
[21]

Wilson, A. G. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018

work page 2018
[22]

An investiga- tion into neural net optimization via hessian eigenvalue density

Ghorbani, B., Krishnan, S., and Xiao, Y . An investiga- tion into neural net optimization via hessian eigenvalue density. InICML, 2019

work page 2019
[23]

Efficient training of bert by progressively stacking

Gong, L., He, D., Li, Z., Qin, T., Wang, L., and Liu, T. Efficient training of bert by progressively stacking. In ICML, 2019

work page 2019
[24]

J., Vinyals, O., and Saxe, A

Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualita- tively characterizing neural network optimization prob- lems.ICLR, 2015

work page 2015
[25]

A., and Dyer, E

Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace.arXiv, 2018

work page 2018
[26]

M., and Le, Q

Ha, D., Dai, A. M., and Le, Q. V . Hypernetworks. InICLR, 2017

work page 2017
[27]

The platonic representation hypothesis

Huh, M., Cheung, B., Wang, T., and Isola, P. The platonic representation hypothesis. InICML, 2024

work page 2024
[28]

From low rank gradient subspace stabilization to low-rank weights: Observations, theories, and applications

Grama, A., Tian, Y ., and Wang, Z. From low rank gradient subspace stabilization to low-rank weights: Observations, theories, and applications. InICML, 2025

work page 2025
[29]

and Telgarsky, M

Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. InICLR, 2019

work page 2019
[30]

Fantastic generalization measures and where to find them

Bengio, S. Fantastic generalization measures and where to find them. InICLR, 2020

work page 2020
[31]

The universal weight subspace hypothesis

Yuille, A. The universal weight subspace hypothesis. arXiv, 2025

work page 2025
[32]

Parameter prediction for unseen deep archi- tectures

Soriano, A. Parameter prediction for unseen deep archi- tectures. InNeurIPS, 2021

work page 2021
[33]

Can we scale transformers to predict parameters of diverse ima- genet models? InICML, 2023

Knyazev, B., Hwang, D., and Lacoste-Julien, S. Can we scale transformers to predict parameters of diverse ima- genet models? InICML, 2023

work page 2023
[34]

Similar- ity of neural network representations revisited

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similar- ity of neural network representations revisited. InICML, 2019

work page 2019
[35]

Explaining landscape connectivity of low-cost solutions for multilayer nets

Arora, S., and Ge, R. Explaining landscape connectivity of low-cost solutions for multilayer nets. InNeurIPS, 2019

work page 2019
[36]

Le, T. H. and Jegelka, S. Training invariances and the low- rank phenomenon: Beyond linear networks. InICLR, 2022

work page 2022
[37]

Measuring the intrinsic dimension of objective landscapes

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes. InICLR, 2018

work page 2018
[38]

Secure on-device video ood detection without backpropagation

Tu, Z., Hu, X., and Zhao, Y . Secure on-device video ood detection without backpropagation. InICCV, 2025

work page 2025
[39]

Drag- and-drop llms: Zero-shot prompt-to-weights

Li, Z., Wang, P., Sch ¨urholt, K., Borth, D., et al. Drag- and-drop llms: Zero-shot prompt-to-weights. InNeurIPS, 2025

work page 2025
[40]

T., Lorraine, J., and Lucas, J

Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. InICLR, 2024

work page 2024
[41]

Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass

Zhang, M. Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass. InICML, 2026

work page 2026
[42]

and Li, J

Lyu, K. and Li, J. Gradient descent maximizes the margin of homogeneous neural networks. InICLR, 2020

work page 2020
[43]

K., Sethna, J

Transtrum, M. K., Sethna, J. P., and Chaudhari, P. The training process of many deep networks explores the same low-dimensional manifold.PNAS, 2024

work page 2024
[44]

A function space view of bounded norm infinite width relu nets: The multivariate case

Ongie, G., Willett, R., Soudry, D., and Srebro, N. A function space view of bounded norm infinite width relu nets: The multivariate case. InICLR, 2020

work page 2020
[45]

Carbon emissions and large neural network training.arXiv, 2021

Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.- M., Rothchild, D., So, D., Texier, M., and Dean, J. Carbon emissions and large neural network training.arXiv, 2021

work page 2021
[46]

S., Radosavovic, I., Brooks, T., Efros, A

Peebles, W. S., Radosavovic, I., Brooks, T., Efros, A. A., and Malik, J. Learning to learn with generative models of neural network checkpoints.arXiv, 2022

work page 2022
[47]

Relative flatness and generalization

Petzka, H., Kamp, M., Adilova, L., Sminchisescu, C., and Boley, M. Relative flatness and generalization. In NeurIPS, 2021. 11 Position: Weight Space Should Be a First-Class Generative AI Modality

work page 2021
[48]

Unlocking emergent modu- larity in large language models

Qiu, Z., Huang, Z., and Fu, J. Unlocking emergent modu- larity in large language models. InNAACL, 2024

work page 2024
[49]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023

work page 2023
[50]

Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Wadhwa, N., Rubinstein, M., and Aberman, K. Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. InCVPR, 2024

work page 2024
[51]

U., Dauphin, Y ., and Bottou, L

Sagun, L., Evci, U., Guney, V . U., Dauphin, Y ., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks.arXiv, 2017

work page 2017
[52]

Flow to learn: Flow matching on neural network parameters

Saragih, D., Cao, D., Balaji, T., and Santhosh, A. Flow to learn: Flow matching on neural network parameters. InWorkshop on Neural Network Weights as a New Data Modality, 2025. Sch¨urholt, K., Kostadinov, D., and Borth, D. Self- Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction. InNeurIPS Workshop, 2021. Sch¨...

work page 2025
[53]

Per- sonalized federated learning using hypernetworks

Shamsian, A., Navon, A., Fetaya, E., and Chechik, G. Per- sonalized federated learning using hypernetworks. In ICML, 2021

work page 2021
[54]

W., Zhang, Y ., Fetaya, E., Chechik, G., and Maron, H

Shamsian, A., Navon, A., Zhang, D. W., Zhang, Y ., Fetaya, E., Chechik, G., and Maron, H. Improved generaliza- tion of weight space networks via augmentations.ICML, 2024

work page 2024
[55]

and Mondelli, M

Shevchenko, A. and Mondelli, M. Landscape connec- tivity and dropout stability of sgd solutions for over- parameterized neural networks. InICML, 2020

work page 2020
[56]

Does sgd really happen in tiny subspaces? InICLR, 2025

Song, M., Ahn, K., and Yun, C. Does sgd really happen in tiny subspaces? InICLR, 2025

work page 2025
[57]

Soro, B., Andreis, B., Lee, H., Jeong, W., Chong, S., Hutter, F., and Hwang, S. J. Diffusion-based neural network weights generation. InICLR, 2025

work page 2025
[58]

The implicit bias of gradient descent on separable data

Soudry, D., Hoffer, E., and Srebro, N. The implicit bias of gradient descent on separable data. InICLR, 2018

work page 2018
[59]

J., Chen, P.-Y ., Das, P., Melnyk, I., Sattigeri, P., and Lai, R

Tatro, N. J., Chen, P.-Y ., Das, P., Melnyk, I., Sattigeri, P., and Lai, R. Optimizing mode connectivity via neuron alignment. InNeurIPS, 2020

work page 2020
[60]

Team, T. H. et al. Hy-wu (part i): An extensible functional neural memory framework and an instantiation in text- guided image editing.arXiv, 2026

work page 2026
[61]

Predicting neural network accuracy from weights.arXiv, 2020

Tolstikhin, I. Predicting neural network accuracy from weights.arXiv, 2020

work page 2020
[62]

Neural network diffusion.arXiv, 2024

Zang, Z., Darrell, T., Liu, Z., and You, Y . Neural network diffusion.arXiv, 2024

work page 2024
[63]

Recurrent diffusion for large-scale parameter generation

Wang, K., Tang, D., Zhao, W., Sch ¨urholt, K., Wang, Z., and You, Y . Recurrent diffusion for large-scale parameter generation. InNeurIPS, 2025

work page 2025
[64]

and Wang, Z

Wang, P. and Wang, Z. Why neural network can discover symbolic structures with gradient-based training: An al- gebraic and geometric foundation for neurosymbolic rea- soning.arXiv, 2025

work page 2025
[65]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Farhadi, A., Carmon, Y ., Kornblith, S., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, 2022

work page 2022
[66]

Deep model reassembly

Yang, X., Zhou, D., Liu, S., Ye, J., and Wang, X. Deep model reassembly. InNeurIPS, 2022

work page 2022
[67]

Bayesian nonparametric federated learning of neural networks

Hoang, T., and Khazaeni, Y . Bayesian nonparametric federated learning of neural networks. InICML, 2019

work page 2019
[68]

Generative modeling of weights: Generalization or memorization? InCVPR, 2026

Zeng, B., Yin, Y ., Xu, Z., and Liu, Z. Generative modeling of weights: Generalization or memorization? InCVPR, 2026

work page 2026
[69]

Emergent modularity in pre-trained transformers.ACL Findings, 2024

Zhang, Z., Zeng, Z., Lin, Y ., Xiao, C., Wang, X., Han, X., Liu, Z., Xie, R., Sun, M., and Zhou, J. Emergent modularity in pre-trained transformers.ACL Findings, 2024. 12 Position: Weight Space Should Be a First-Class Generative AI Modality

work page 2024
[70]

Symbolic learning to optimize: Towards interpretability and scala- bility

Zheng, W., Chen, T., Hu, T.-K., and Wang, Z. Symbolic learning to optimize: Towards interpretability and scala- bility. InICLR, 2022

work page 2022
[71]

Z., Wang, Z., and Lee, J

Zhu, H., Zhang, Z., Cong, W., Liu, X., Park, S., Chandra, V ., Long, B., Pan, D. Z., Wang, Z., and Lee, J. Apollo: Sgd-like memory, adamw-level performance. InMLSys, 2025. 13

work page 2025

[1] [1]

K., Hayase, J., and Srinivasa, S

Ainsworth, S. K., Hayase, J., and Srinivasa, S. Git re-basin: Merging models modulo permutation symmetries.ICLR, 2023

work page 2023

[2] [2]

Andreis, B., Soro, B., Torr, P., and Hwang, S. J. Set- based neural network encoding without weight tying. In NeurIPS, 2024

work page 2024

[3] [3]

D., Kwon, S

Balzano, L., Ding, T., Haeffele, B. D., Kwon, S. M., Qu, Q., Wang, P., Wang, Z., and Yaras, C. An overview of low-rank structures in the training and adaptation of large models.arXiv, 2025

work page 2025

[4] [4]

Revisiting model stitching to compare neural representations

Bansal, Y ., Nakkiran, P., and Barak, B. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

work page 2021

[5] [5]

Understanding the role of individual units in a deep neural network.PNAS, 2020

Bau, D., Zhu, J.-Y ., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. Understanding the role of individual units in a deep neural network.PNAS, 2020

work page 2020

[6] [6]

SMASH: One-shot model architecture search through hypernet- works

Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: One-shot model architecture search through hypernet- works. InICLR, 2018

work page 2018

[7] [7]

Charakorn, R., Cetin, E., Tang, Y ., and Lange, R. T. Text-to- lora: Instant transformer adaption. InICML, 2025

work page 2025

[8] [8]

Charakorn, R., Cetin, E., Uesaka, S., and Lange, R. T. Doc- to-lora: Learning to instantly internalize contexts.arXiv, 2026

work page 2026

[9] [9]

Net2net: Accelerat- ing learning via knowledge transfer

Chen, T., Goodfellow, I., and Shlens, J. Net2net: Accelerat- ing learning via knowledge transfer. InICLR, 2016

work page 2016

[10] [10]

The lottery ticket hypothesis for pre- trained bert networks

Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y ., Wang, Z., and Carbin, M. The lottery ticket hypothesis for pre- trained bert networks. InNeurIPS, 2020

work page 2020

[11] [11]

Sym- bolic discovery of optimization algorithms

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y ., et al. Sym- bolic discovery of optimization algorithms. InNeurIPS, 2023

work page 2023

[12] [12]

B., and LeCun, Y

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y . The loss surfaces of multilayer networks. InAISTATS, 2015

work page 2015

[13] [13]

Discovering sym- bolic models from deep learning with inductive biases

Cranmer, K., Spergel, D., and Ho, S. Discovering sym- bolic models from deep learning with inductive biases. In NeurIPS, 2020. Csord´as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional modularity through differentiable weight masks. InICLR, 2021

work page 2020

[14] [14]

Sharp minima can generalize for deep nets

Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y . Sharp minima can generalize for deep nets. InICML, 2017

work page 2017

[15] [15]

Interpreting the weight space of customized diffusion models

Dravid, A., Gandelsman, Y ., Wang, K.-C., Abdal, R., Wet- zstein, G., Efros, A., and Aberman, K. Interpreting the weight space of customized diffusion models. InNeurIPS, 2024

work page 2024

[16] [16]

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. InICML, 2018

work page 2018

[17] [17]

The role of permutation invariance in linear mode connectivity of neural networks

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. The role of permutation invariance in linear mode connectivity of neural networks. InICLR, 2022. Erkoc ¸, Z., Ma, F., Shan, Q., Nießner, M., and Dai, A. Hy- perDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion. InICCV, 2023

work page 2022

[18] [18]

Sharpness-aware minimization for efficiently improving generalization

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. InICLR, 2021

work page 2021

[19] [19]

and Carbin, M

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR, 2019

work page 2019

[20] [20]

S., Gupte, A., and Poggio, T

Galanti, T., Siegel, Z. S., Gupte, A., and Poggio, T. A. SGD and weight decay secretly minimize the rank of your neural network. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. 10 Position: Weight Space Should Be a First-Class Generative AI Modality

work page 2024

[21] [21]

Wilson, A. G. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018

work page 2018

[22] [22]

An investiga- tion into neural net optimization via hessian eigenvalue density

Ghorbani, B., Krishnan, S., and Xiao, Y . An investiga- tion into neural net optimization via hessian eigenvalue density. InICML, 2019

work page 2019

[23] [23]

Efficient training of bert by progressively stacking

Gong, L., He, D., Li, Z., Qin, T., Wang, L., and Liu, T. Efficient training of bert by progressively stacking. In ICML, 2019

work page 2019

[24] [24]

J., Vinyals, O., and Saxe, A

Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualita- tively characterizing neural network optimization prob- lems.ICLR, 2015

work page 2015

[25] [25]

A., and Dyer, E

Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace.arXiv, 2018

work page 2018

[26] [26]

M., and Le, Q

Ha, D., Dai, A. M., and Le, Q. V . Hypernetworks. InICLR, 2017

work page 2017

[27] [27]

The platonic representation hypothesis

Huh, M., Cheung, B., Wang, T., and Isola, P. The platonic representation hypothesis. InICML, 2024

work page 2024

[28] [28]

From low rank gradient subspace stabilization to low-rank weights: Observations, theories, and applications

Grama, A., Tian, Y ., and Wang, Z. From low rank gradient subspace stabilization to low-rank weights: Observations, theories, and applications. InICML, 2025

work page 2025

[29] [29]

and Telgarsky, M

Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. InICLR, 2019

work page 2019

[30] [30]

Fantastic generalization measures and where to find them

Bengio, S. Fantastic generalization measures and where to find them. InICLR, 2020

work page 2020

[31] [31]

The universal weight subspace hypothesis

Yuille, A. The universal weight subspace hypothesis. arXiv, 2025

work page 2025

[32] [32]

Parameter prediction for unseen deep archi- tectures

Soriano, A. Parameter prediction for unseen deep archi- tectures. InNeurIPS, 2021

work page 2021

[33] [33]

Can we scale transformers to predict parameters of diverse ima- genet models? InICML, 2023

Knyazev, B., Hwang, D., and Lacoste-Julien, S. Can we scale transformers to predict parameters of diverse ima- genet models? InICML, 2023

work page 2023

[34] [34]

Similar- ity of neural network representations revisited

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similar- ity of neural network representations revisited. InICML, 2019

work page 2019

[35] [35]

Explaining landscape connectivity of low-cost solutions for multilayer nets

Arora, S., and Ge, R. Explaining landscape connectivity of low-cost solutions for multilayer nets. InNeurIPS, 2019

work page 2019

[36] [36]

Le, T. H. and Jegelka, S. Training invariances and the low- rank phenomenon: Beyond linear networks. InICLR, 2022

work page 2022

[37] [37]

Measuring the intrinsic dimension of objective landscapes

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes. InICLR, 2018

work page 2018

[38] [38]

Secure on-device video ood detection without backpropagation

Tu, Z., Hu, X., and Zhao, Y . Secure on-device video ood detection without backpropagation. InICCV, 2025

work page 2025

[39] [39]

Drag- and-drop llms: Zero-shot prompt-to-weights

Li, Z., Wang, P., Sch ¨urholt, K., Borth, D., et al. Drag- and-drop llms: Zero-shot prompt-to-weights. InNeurIPS, 2025

work page 2025

[40] [40]

T., Lorraine, J., and Lucas, J

Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. InICLR, 2024

work page 2024

[41] [41]

Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass

Zhang, M. Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass. InICML, 2026

work page 2026

[42] [42]

and Li, J

Lyu, K. and Li, J. Gradient descent maximizes the margin of homogeneous neural networks. InICLR, 2020

work page 2020

[43] [43]

K., Sethna, J

Transtrum, M. K., Sethna, J. P., and Chaudhari, P. The training process of many deep networks explores the same low-dimensional manifold.PNAS, 2024

work page 2024

[44] [44]

A function space view of bounded norm infinite width relu nets: The multivariate case

Ongie, G., Willett, R., Soudry, D., and Srebro, N. A function space view of bounded norm infinite width relu nets: The multivariate case. InICLR, 2020

work page 2020

[45] [45]

Carbon emissions and large neural network training.arXiv, 2021

Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.- M., Rothchild, D., So, D., Texier, M., and Dean, J. Carbon emissions and large neural network training.arXiv, 2021

work page 2021

[46] [46]

S., Radosavovic, I., Brooks, T., Efros, A

Peebles, W. S., Radosavovic, I., Brooks, T., Efros, A. A., and Malik, J. Learning to learn with generative models of neural network checkpoints.arXiv, 2022

work page 2022

[47] [47]

Relative flatness and generalization

Petzka, H., Kamp, M., Adilova, L., Sminchisescu, C., and Boley, M. Relative flatness and generalization. In NeurIPS, 2021. 11 Position: Weight Space Should Be a First-Class Generative AI Modality

work page 2021

[48] [48]

Unlocking emergent modu- larity in large language models

Qiu, Z., Huang, Z., and Fu, J. Unlocking emergent modu- larity in large language models. InNAACL, 2024

work page 2024

[49] [49]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023

work page 2023

[50] [50]

Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Wadhwa, N., Rubinstein, M., and Aberman, K. Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. InCVPR, 2024

work page 2024

[51] [51]

U., Dauphin, Y ., and Bottou, L

Sagun, L., Evci, U., Guney, V . U., Dauphin, Y ., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks.arXiv, 2017

work page 2017

[52] [52]

Flow to learn: Flow matching on neural network parameters

Saragih, D., Cao, D., Balaji, T., and Santhosh, A. Flow to learn: Flow matching on neural network parameters. InWorkshop on Neural Network Weights as a New Data Modality, 2025. Sch¨urholt, K., Kostadinov, D., and Borth, D. Self- Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction. InNeurIPS Workshop, 2021. Sch¨...

work page 2025

[53] [53]

Per- sonalized federated learning using hypernetworks

Shamsian, A., Navon, A., Fetaya, E., and Chechik, G. Per- sonalized federated learning using hypernetworks. In ICML, 2021

work page 2021

[54] [54]

W., Zhang, Y ., Fetaya, E., Chechik, G., and Maron, H

Shamsian, A., Navon, A., Zhang, D. W., Zhang, Y ., Fetaya, E., Chechik, G., and Maron, H. Improved generaliza- tion of weight space networks via augmentations.ICML, 2024

work page 2024

[55] [55]

and Mondelli, M

Shevchenko, A. and Mondelli, M. Landscape connec- tivity and dropout stability of sgd solutions for over- parameterized neural networks. InICML, 2020

work page 2020

[56] [56]

Does sgd really happen in tiny subspaces? InICLR, 2025

Song, M., Ahn, K., and Yun, C. Does sgd really happen in tiny subspaces? InICLR, 2025

work page 2025

[57] [57]

Soro, B., Andreis, B., Lee, H., Jeong, W., Chong, S., Hutter, F., and Hwang, S. J. Diffusion-based neural network weights generation. InICLR, 2025

work page 2025

[58] [58]

The implicit bias of gradient descent on separable data

Soudry, D., Hoffer, E., and Srebro, N. The implicit bias of gradient descent on separable data. InICLR, 2018

work page 2018

[59] [59]

J., Chen, P.-Y ., Das, P., Melnyk, I., Sattigeri, P., and Lai, R

Tatro, N. J., Chen, P.-Y ., Das, P., Melnyk, I., Sattigeri, P., and Lai, R. Optimizing mode connectivity via neuron alignment. InNeurIPS, 2020

work page 2020

[60] [60]

Team, T. H. et al. Hy-wu (part i): An extensible functional neural memory framework and an instantiation in text- guided image editing.arXiv, 2026

work page 2026

[61] [61]

Predicting neural network accuracy from weights.arXiv, 2020

Tolstikhin, I. Predicting neural network accuracy from weights.arXiv, 2020

work page 2020

[62] [62]

Neural network diffusion.arXiv, 2024

Zang, Z., Darrell, T., Liu, Z., and You, Y . Neural network diffusion.arXiv, 2024

work page 2024

[63] [63]

Recurrent diffusion for large-scale parameter generation

Wang, K., Tang, D., Zhao, W., Sch ¨urholt, K., Wang, Z., and You, Y . Recurrent diffusion for large-scale parameter generation. InNeurIPS, 2025

work page 2025

[64] [64]

and Wang, Z

Wang, P. and Wang, Z. Why neural network can discover symbolic structures with gradient-based training: An al- gebraic and geometric foundation for neurosymbolic rea- soning.arXiv, 2025

work page 2025

[65] [65]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Farhadi, A., Carmon, Y ., Kornblith, S., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, 2022

work page 2022

[66] [66]

Deep model reassembly

Yang, X., Zhou, D., Liu, S., Ye, J., and Wang, X. Deep model reassembly. InNeurIPS, 2022

work page 2022

[67] [67]

Bayesian nonparametric federated learning of neural networks

Hoang, T., and Khazaeni, Y . Bayesian nonparametric federated learning of neural networks. InICML, 2019

work page 2019

[68] [68]

Generative modeling of weights: Generalization or memorization? InCVPR, 2026

Zeng, B., Yin, Y ., Xu, Z., and Liu, Z. Generative modeling of weights: Generalization or memorization? InCVPR, 2026

work page 2026

[69] [69]

Emergent modularity in pre-trained transformers.ACL Findings, 2024

Zhang, Z., Zeng, Z., Lin, Y ., Xiao, C., Wang, X., Han, X., Liu, Z., Xie, R., Sun, M., and Zhou, J. Emergent modularity in pre-trained transformers.ACL Findings, 2024. 12 Position: Weight Space Should Be a First-Class Generative AI Modality

work page 2024

[70] [70]

Symbolic learning to optimize: Towards interpretability and scala- bility

Zheng, W., Chen, T., Hu, T.-K., and Wang, Z. Symbolic learning to optimize: Towards interpretability and scala- bility. InICLR, 2022

work page 2022

[71] [71]

Z., Wang, Z., and Lee, J

Zhu, H., Zhang, Z., Cong, W., Liu, X., Park, S., Chandra, V ., Long, B., Pan, D. Z., Wang, Z., and Lee, J. Apollo: Sgd-like memory, adamw-level performance. InMLSys, 2025. 13

work page 2025