pith. sign in

hub Canonical reference

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer

Canonical reference. 100% of citing Pith papers cite this work as background.

19 Pith papers citing it
Background 100% of classified citations

hub tools

citation-role summary

background 6

citation-polarity summary

roles

background 6

polarities

background 6

representative citing papers

Deep Delta Learning

cs.LG · 2026-01-01 · unverdicted · novelty 7.0

Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.

Training Deep Learning Models with Norm-Constrained LMOs

cs.LG · 2025-02-11 · unverdicted · novelty 7.0

Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.

Sparse Layers are Critical to Scaling Looped Language Models

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

Feature Starvation as Geometric Instability in Sparse Autoencoders

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

Spectral Condition for $\mu$P under Width-Depth Scaling

cs.LG · 2026-02-28 · unverdicted · novelty 6.0

A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

citing papers explorer

Showing 19 of 19 citing papers.