Title resolution pending

Wang, M · 2023 · arXiv 2310.00692

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Why Muon Outperforms Adam: A Curvature Perspective cs.LG · 2026-06-03 · conditional · none · ref 191
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws cs.LG · 2026-05-25 · unverdicted · none · ref 2
Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models cs.LG · 2026-05-26 · unverdicted · none · ref 41
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer