pith. sign in

arxiv: 2605.18632 · v1 · pith:EQ7YGT5Onew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Position: Weight Space Should Be a First-Class Generative AI Modality

Pith reviewed 2026-05-20 12:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords weight spacegenerative modelingneural network checkpointsmodel synthesisfirst-class modalityadapter generationfine-tuning alternativesstructured weight regions
0
0 comments X

The pith

Treating neural network checkpoints as a first-class generative modality lets models be synthesized in weight space to match fine-tuning at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper argues that the millions of existing trained neural network checkpoints form a valuable data resource that should be modeled directly as a generative modality rather than starting from scratch for every new task. The key observation is that high-performing weights occupy low-dimensional structured regions shaped by symmetry, flatness, modularity, and shared subspaces, so generative models can sample new checkpoints on demand. A sympathetic reader would care because this could cut adaptation costs by orders of magnitude while shifting AI development from per-task optimization to sampling from learned weight distributions. The authors organize current methods into a five-stage pipeline, point to practical successes at adapter scale, and identify unrestricted frontier-scale synthesis as the remaining open problem.

Core claim

Neural network checkpoints should be treated as a first-class data modality, and generative modeling in weight space should be standardized as a core machine learning primitive. High-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces, allowing weights to be synthesized on demand that often match fine-tuning performance while reducing adaptation cost by orders of magnitude.

What carries the argument

Generative synthesis in weight space: learning distributions over trained checkpoints to sample new weight vectors that inherit the structural properties of high-performing models.

If this is right

  • New checkpoints can be created for specific tasks without running full fine-tuning or optimization from random initialization.
  • Adaptation to new domains or architectures becomes feasible at orders-of-magnitude lower compute cost than current practice.
  • Methods can be standardized into a five-stage pipeline covering data collection, representation learning, distribution modeling, sampling, and evaluation.
  • Practical deployment is already possible for adapter-scale and conditional generation settings.
  • AI systems can begin to improve or create other AI systems by sampling directly from learned weight distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model repositories could evolve from static collections into primary training corpora for meta-generative systems.
  • Conditional control over sampled weights might enable systematic creation of models with targeted properties such as efficiency or robustness.
  • The same low-dimensional structure could inform new approaches to model merging, compression, and modular composition.

Load-bearing premise

The structural properties observed in recent adapter-scale and conditional generation results will scale to unrestricted frontier-scale checkpoint synthesis without additional fundamental limitations.

What would settle it

An experiment in which generative synthesis from weight distributions fails to reach fine-tuning accuracy on a large new task, or in which no low-dimensional structured regions are found among frontier-model weights.

Figures

Figures reproduced from arXiv: 2605.18632 by Kai Wang, Peihao Wang, Zhangyang Wang.

Figure 1
Figure 1. Figure 1: A regime map for weight-space generation (see $3). We frame neural weight generation as conditional sampling from p(W | A, C, R), where W denotes generated weights or weight updates, A specifies the architecture graph and tensor schema, C encodes task or user conditions, and R captures training-recipe and checkpoint-lineage information. The three regimes distinguish whether architecture and conditioning ar… view at source ↗
read the original abstract

Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community's default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a position paper arguing that neural network checkpoints should be treated as a first-class generative AI modality. It claims that high-performing models occupy low-dimensional, highly structured regions of weight space due to symmetry, flatness, modularity, and shared subspaces. Recent advances in weight synthesis are said to match fine-tuning performance at orders-of-magnitude lower adaptation cost. The authors organize existing methods into a five-stage pipeline, survey practical applications, and note that adapter-scale and conditional generation are advancing while unrestricted frontier-scale checkpoint synthesis remains open. The goal is to shift the community from per-task optimization toward sampling models from learned weight distributions.

Significance. If the position holds, it could drive a paradigm shift in machine learning by standardizing generative modeling over weight distributions, enabling AI systems to create or improve other models with substantially reduced compute. This would build directly on cited advances in adapters and conditional generation to realize large efficiency gains. The significance is tempered by the acknowledged open problem at frontier scale, but the framing as a core primitive could usefully redirect research priorities if the structural assumptions prove robust.

major comments (2)
  1. [Structural fact paragraph] The paragraph beginning 'We contend that these results reflect an underlying structural fact': the central claim that observed synthesis results reflect low-dimensional, symmetric, flat, and modular structure enabling orders-of-magnitude cost reduction is asserted on the basis of adapter-scale and conditional-generation advances. No measurement of effective dimensionality, no scaling relation between manifold dimension and parameter count, and no ablation showing that these properties survive removal of adapters are supplied, leaving the extrapolation to unrestricted frontier-scale synthesis untested and load-bearing for the main thesis.
  2. [Five-stage pipeline section] The section organizing existing methods into a five-stage pipeline: while the pipeline provides a useful taxonomy, the manuscript does not analyze how each stage would scale when the effective dimension of high-performing weight regions grows with model size, nor does it identify capacity limits of current generative models that could prevent the claimed cost reductions at frontier scale.
minor comments (2)
  1. The abstract states that 'millions of trained weight vectors now exist' without a supporting citation or rough estimate of the current scale of public checkpoints.
  2. [Applications survey] The survey of applications would benefit from explicit cross-references to the specific performance numbers or cost-reduction factors reported in the cited works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our position paper. We address the major comments below, clarifying our approach as a synthesis of existing work and outlining planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Structural fact paragraph] The paragraph beginning 'We contend that these results reflect an underlying structural fact': the central claim that observed synthesis results reflect low-dimensional, symmetric, flat, and modular structure enabling orders-of-magnitude cost reduction is asserted on the basis of adapter-scale and conditional-generation advances. No measurement of effective dimensionality, no scaling relation between manifold dimension and parameter count, and no ablation showing that these properties survive removal of adapters are supplied, leaving the extrapolation to unrestricted frontier-scale synthesis untested and load-bearing for the main thesis.

    Authors: As a position paper, our intent is to highlight the implications of recent advances in weight-space generation rather than to conduct new empirical studies. The structural properties are supported by the body of cited work on neural network geometry. We will revise the relevant paragraph to explicitly note that the low-dimensional structure is inferred from adapter-scale results and to emphasize that extension to frontier-scale models is a motivating hypothesis rather than a proven fact. We will also incorporate additional citations on measurements of effective dimensionality in weight spaces to better ground the claim. revision: partial

  2. Referee: [Five-stage pipeline section] The section organizing existing methods into a five-stage pipeline: while the pipeline provides a useful taxonomy, the manuscript does not analyze how each stage would scale when the effective dimension of high-performing weight regions grows with model size, nor does it identify capacity limits of current generative models that could prevent the claimed cost reductions at frontier scale.

    Authors: We agree that a more explicit discussion of scaling would be beneficial. In the revised manuscript, we will add analysis to the pipeline section addressing how the stages might be affected by increasing effective dimensionality and the known limitations of current generative models (e.g., mode collapse or computational intractability in very high dimensions). This will better contextualize why unrestricted frontier-scale synthesis is presented as an open challenge. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper references external advances without internal reduction

full rationale

This is a high-level position paper that organizes existing methods into a five-stage pipeline and interprets recent external results as evidence for low-dimensional structure in weight space. No equations, fitted parameters, or derivations appear in the manuscript. The central contention that results 'reflect an underlying structural fact' is presented as an interpretive claim supported by cited prior work rather than a self-referential construction or load-bearing self-citation chain internal to this document. The paper explicitly flags frontier-scale synthesis as open, avoiding any claim that reduces to its own inputs by definition. This is the expected non-finding for a survey-style position statement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on one central domain assumption about the geometry of weight space and introduces no free parameters or new entities.

axioms (1)
  • domain assumption High-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces.
    This structural fact is presented as the underlying reason generative modeling in weight space is feasible.

pith-pipeline@v0.9.0 · 5718 in / 1137 out tokens · 42386 ms · 2026-05-20T12:06:00.190538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

  1. [1]

    K., Hayase, J., and Srinivasa, S

    Ainsworth, S. K., Hayase, J., and Srinivasa, S. Git re-basin: Merging models modulo permutation symmetries.ICLR, 2023

  2. [2]

    Andreis, B., Soro, B., Torr, P., and Hwang, S. J. Set- based neural network encoding without weight tying. In NeurIPS, 2024

  3. [3]

    D., Kwon, S

    Balzano, L., Ding, T., Haeffele, B. D., Kwon, S. M., Qu, Q., Wang, P., Wang, Z., and Yaras, C. An overview of low-rank structures in the training and adaptation of large models.arXiv, 2025

  4. [4]

    Revisiting model stitching to compare neural representations

    Bansal, Y ., Nakkiran, P., and Barak, B. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

  5. [5]

    Understanding the role of individual units in a deep neural network.PNAS, 2020

    Bau, D., Zhu, J.-Y ., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. Understanding the role of individual units in a deep neural network.PNAS, 2020

  6. [6]

    SMASH: One-shot model architecture search through hypernet- works

    Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: One-shot model architecture search through hypernet- works. InICLR, 2018

  7. [7]

    Charakorn, R., Cetin, E., Tang, Y ., and Lange, R. T. Text-to- lora: Instant transformer adaption. InICML, 2025

  8. [8]

    Charakorn, R., Cetin, E., Uesaka, S., and Lange, R. T. Doc- to-lora: Learning to instantly internalize contexts.arXiv, 2026

  9. [9]

    Net2net: Accelerat- ing learning via knowledge transfer

    Chen, T., Goodfellow, I., and Shlens, J. Net2net: Accelerat- ing learning via knowledge transfer. InICLR, 2016

  10. [10]

    The lottery ticket hypothesis for pre- trained bert networks

    Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y ., Wang, Z., and Carbin, M. The lottery ticket hypothesis for pre- trained bert networks. InNeurIPS, 2020

  11. [11]

    Sym- bolic discovery of optimization algorithms

    Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y ., et al. Sym- bolic discovery of optimization algorithms. InNeurIPS, 2023

  12. [12]

    B., and LeCun, Y

    Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y . The loss surfaces of multilayer networks. InAISTATS, 2015

  13. [13]

    Discovering sym- bolic models from deep learning with inductive biases

    Cranmer, K., Spergel, D., and Ho, S. Discovering sym- bolic models from deep learning with inductive biases. In NeurIPS, 2020. Csord´as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional modularity through differentiable weight masks. InICLR, 2021

  14. [14]

    Sharp minima can generalize for deep nets

    Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y . Sharp minima can generalize for deep nets. InICML, 2017

  15. [15]

    Interpreting the weight space of customized diffusion models

    Dravid, A., Gandelsman, Y ., Wang, K.-C., Abdal, R., Wet- zstein, G., Efros, A., and Aberman, K. Interpreting the weight space of customized diffusion models. InNeurIPS, 2024

  16. [16]

    Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. InICML, 2018

  17. [17]

    The role of permutation invariance in linear mode connectivity of neural networks

    Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. The role of permutation invariance in linear mode connectivity of neural networks. InICLR, 2022. Erkoc ¸, Z., Ma, F., Shan, Q., Nießner, M., and Dai, A. Hy- perDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion. InICCV, 2023

  18. [18]

    Sharpness-aware minimization for efficiently improving generalization

    Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. InICLR, 2021

  19. [19]

    and Carbin, M

    Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR, 2019

  20. [20]

    S., Gupte, A., and Poggio, T

    Galanti, T., Siegel, Z. S., Gupte, A., and Poggio, T. A. SGD and weight decay secretly minimize the rank of your neural network. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. 10 Position: Weight Space Should Be a First-Class Generative AI Modality

  21. [21]

    Wilson, A. G. Loss surfaces, mode connectivity, and fast ensembling of dnns. InNeurIPS, 2018

  22. [22]

    An investiga- tion into neural net optimization via hessian eigenvalue density

    Ghorbani, B., Krishnan, S., and Xiao, Y . An investiga- tion into neural net optimization via hessian eigenvalue density. InICML, 2019

  23. [23]

    Efficient training of bert by progressively stacking

    Gong, L., He, D., Li, Z., Qin, T., Wang, L., and Liu, T. Efficient training of bert by progressively stacking. In ICML, 2019

  24. [24]

    J., Vinyals, O., and Saxe, A

    Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualita- tively characterizing neural network optimization prob- lems.ICLR, 2015

  25. [25]

    A., and Dyer, E

    Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace.arXiv, 2018

  26. [26]

    M., and Le, Q

    Ha, D., Dai, A. M., and Le, Q. V . Hypernetworks. InICLR, 2017

  27. [27]

    The platonic representation hypothesis

    Huh, M., Cheung, B., Wang, T., and Isola, P. The platonic representation hypothesis. InICML, 2024

  28. [28]

    From low rank gradient subspace stabilization to low-rank weights: Observations, theories, and applications

    Grama, A., Tian, Y ., and Wang, Z. From low rank gradient subspace stabilization to low-rank weights: Observations, theories, and applications. InICML, 2025

  29. [29]

    and Telgarsky, M

    Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. InICLR, 2019

  30. [30]

    Fantastic generalization measures and where to find them

    Bengio, S. Fantastic generalization measures and where to find them. InICLR, 2020

  31. [31]

    The universal weight subspace hypothesis

    Yuille, A. The universal weight subspace hypothesis. arXiv, 2025

  32. [32]

    Parameter prediction for unseen deep archi- tectures

    Soriano, A. Parameter prediction for unseen deep archi- tectures. InNeurIPS, 2021

  33. [33]

    Can we scale transformers to predict parameters of diverse ima- genet models? InICML, 2023

    Knyazev, B., Hwang, D., and Lacoste-Julien, S. Can we scale transformers to predict parameters of diverse ima- genet models? InICML, 2023

  34. [34]

    Similar- ity of neural network representations revisited

    Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similar- ity of neural network representations revisited. InICML, 2019

  35. [35]

    Explaining landscape connectivity of low-cost solutions for multilayer nets

    Arora, S., and Ge, R. Explaining landscape connectivity of low-cost solutions for multilayer nets. InNeurIPS, 2019

  36. [36]

    Le, T. H. and Jegelka, S. Training invariances and the low- rank phenomenon: Beyond linear networks. InICLR, 2022

  37. [37]

    Measuring the intrinsic dimension of objective landscapes

    Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes. InICLR, 2018

  38. [38]

    Secure on-device video ood detection without backpropagation

    Tu, Z., Hu, X., and Zhao, Y . Secure on-device video ood detection without backpropagation. InICCV, 2025

  39. [39]

    Drag- and-drop llms: Zero-shot prompt-to-weights

    Li, Z., Wang, P., Sch ¨urholt, K., Borth, D., et al. Drag- and-drop llms: Zero-shot prompt-to-weights. InNeurIPS, 2025

  40. [40]

    T., Lorraine, J., and Lucas, J

    Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. InICLR, 2024

  41. [41]

    Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass

    Zhang, M. Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass. InICML, 2026

  42. [42]

    and Li, J

    Lyu, K. and Li, J. Gradient descent maximizes the margin of homogeneous neural networks. InICLR, 2020

  43. [43]

    K., Sethna, J

    Transtrum, M. K., Sethna, J. P., and Chaudhari, P. The training process of many deep networks explores the same low-dimensional manifold.PNAS, 2024

  44. [44]

    A function space view of bounded norm infinite width relu nets: The multivariate case

    Ongie, G., Willett, R., Soudry, D., and Srebro, N. A function space view of bounded norm infinite width relu nets: The multivariate case. InICLR, 2020

  45. [45]

    Carbon emissions and large neural network training.arXiv, 2021

    Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.- M., Rothchild, D., So, D., Texier, M., and Dean, J. Carbon emissions and large neural network training.arXiv, 2021

  46. [46]

    S., Radosavovic, I., Brooks, T., Efros, A

    Peebles, W. S., Radosavovic, I., Brooks, T., Efros, A. A., and Malik, J. Learning to learn with generative models of neural network checkpoints.arXiv, 2022

  47. [47]

    Relative flatness and generalization

    Petzka, H., Kamp, M., Adilova, L., Sminchisescu, C., and Boley, M. Relative flatness and generalization. In NeurIPS, 2021. 11 Position: Weight Space Should Be a First-Class Generative AI Modality

  48. [48]

    Unlocking emergent modu- larity in large language models

    Qiu, Z., Huang, Z., and Fu, J. Unlocking emergent modu- larity in large language models. InNAACL, 2024

  49. [49]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023

  50. [50]

    Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

    Wadhwa, N., Rubinstein, M., and Aberman, K. Hyper- DreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. InCVPR, 2024

  51. [51]

    U., Dauphin, Y ., and Bottou, L

    Sagun, L., Evci, U., Guney, V . U., Dauphin, Y ., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks.arXiv, 2017

  52. [52]

    Flow to learn: Flow matching on neural network parameters

    Saragih, D., Cao, D., Balaji, T., and Santhosh, A. Flow to learn: Flow matching on neural network parameters. InWorkshop on Neural Network Weights as a New Data Modality, 2025. Sch¨urholt, K., Kostadinov, D., and Borth, D. Self- Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction. InNeurIPS Workshop, 2021. Sch¨...

  53. [53]

    Per- sonalized federated learning using hypernetworks

    Shamsian, A., Navon, A., Fetaya, E., and Chechik, G. Per- sonalized federated learning using hypernetworks. In ICML, 2021

  54. [54]

    W., Zhang, Y ., Fetaya, E., Chechik, G., and Maron, H

    Shamsian, A., Navon, A., Zhang, D. W., Zhang, Y ., Fetaya, E., Chechik, G., and Maron, H. Improved generaliza- tion of weight space networks via augmentations.ICML, 2024

  55. [55]

    and Mondelli, M

    Shevchenko, A. and Mondelli, M. Landscape connec- tivity and dropout stability of sgd solutions for over- parameterized neural networks. InICML, 2020

  56. [56]

    Does sgd really happen in tiny subspaces? InICLR, 2025

    Song, M., Ahn, K., and Yun, C. Does sgd really happen in tiny subspaces? InICLR, 2025

  57. [57]

    Soro, B., Andreis, B., Lee, H., Jeong, W., Chong, S., Hutter, F., and Hwang, S. J. Diffusion-based neural network weights generation. InICLR, 2025

  58. [58]

    The implicit bias of gradient descent on separable data

    Soudry, D., Hoffer, E., and Srebro, N. The implicit bias of gradient descent on separable data. InICLR, 2018

  59. [59]

    J., Chen, P.-Y ., Das, P., Melnyk, I., Sattigeri, P., and Lai, R

    Tatro, N. J., Chen, P.-Y ., Das, P., Melnyk, I., Sattigeri, P., and Lai, R. Optimizing mode connectivity via neuron alignment. InNeurIPS, 2020

  60. [60]

    Team, T. H. et al. Hy-wu (part i): An extensible functional neural memory framework and an instantiation in text- guided image editing.arXiv, 2026

  61. [61]

    Predicting neural network accuracy from weights.arXiv, 2020

    Tolstikhin, I. Predicting neural network accuracy from weights.arXiv, 2020

  62. [62]

    Neural network diffusion.arXiv, 2024

    Zang, Z., Darrell, T., Liu, Z., and You, Y . Neural network diffusion.arXiv, 2024

  63. [63]

    Recurrent diffusion for large-scale parameter generation

    Wang, K., Tang, D., Zhao, W., Sch ¨urholt, K., Wang, Z., and You, Y . Recurrent diffusion for large-scale parameter generation. InNeurIPS, 2025

  64. [64]

    and Wang, Z

    Wang, P. and Wang, Z. Why neural network can discover symbolic structures with gradient-based training: An al- gebraic and geometric foundation for neurosymbolic rea- soning.arXiv, 2025

  65. [65]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Farhadi, A., Carmon, Y ., Kornblith, S., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, 2022

  66. [66]

    Deep model reassembly

    Yang, X., Zhou, D., Liu, S., Ye, J., and Wang, X. Deep model reassembly. InNeurIPS, 2022

  67. [67]

    Bayesian nonparametric federated learning of neural networks

    Hoang, T., and Khazaeni, Y . Bayesian nonparametric federated learning of neural networks. InICML, 2019

  68. [68]

    Generative modeling of weights: Generalization or memorization? InCVPR, 2026

    Zeng, B., Yin, Y ., Xu, Z., and Liu, Z. Generative modeling of weights: Generalization or memorization? InCVPR, 2026

  69. [69]

    Emergent modularity in pre-trained transformers.ACL Findings, 2024

    Zhang, Z., Zeng, Z., Lin, Y ., Xiao, C., Wang, X., Han, X., Liu, Z., Xie, R., Sun, M., and Zhou, J. Emergent modularity in pre-trained transformers.ACL Findings, 2024. 12 Position: Weight Space Should Be a First-Class Generative AI Modality

  70. [70]

    Symbolic learning to optimize: Towards interpretability and scala- bility

    Zheng, W., Chen, T., Hu, T.-K., and Wang, Z. Symbolic learning to optimize: Towards interpretability and scala- bility. InICLR, 2022

  71. [71]

    Z., Wang, Z., and Lee, J

    Zhu, H., Zhang, Z., Cong, W., Liu, X., Park, S., Chandra, V ., Long, B., Pan, D. Z., Wang, Z., and Lee, J. Apollo: Sgd-like memory, adamw-level performance. InMLSys, 2025. 13