Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.
Gradient descent maximizes the margin of homogeneous neural networks
7 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 7representative citing papers
A unified data reconstruction attack achieves provable finite-width recovery in random feature networks and efficient subspace-based reconstruction for general models using weight changes.
Gradient flow on deep diagonal linear LDA networks with balanced initialization converts additive updates to multiplicative updates, automatically conserving the (2/L) quasi-norm.
Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
citing papers explorer
-
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.
-
Efficient Techniques for Data Reconstruction, with Finite-Width Recovery Guarantees
A unified data reconstruction attack achieves provable finite-width recovery in random feature networks and efficient subspace-based reconstruction for general models using weight changes.
-
Implicit Bias in Deep Linear Discriminant Analysis
Gradient flow on deep diagonal linear LDA networks with balanced initialization converts additive updates to multiplicative updates, automatically conserving the (2/L) quasi-norm.
-
The Effect of Mini-Batch Noise on the Implicit Bias of Adam
Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.
-
Prediction horizon shapes representations in predictive learning
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
- The Neural Tangent Kernel for Classification