Training transformers by optimizing only half the DCT coefficients per linear layer achieves validation loss within 0.024 of a dense baseline on Shakespeare character prediction, outperforming matched-parameter LoRA due to preserved rank flexibility.
The unreasonable effectiveness of recurrent neural networks
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.PF 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Training Transformers in Cosine Coefficient Space
Training transformers by optimizing only half the DCT coefficients per linear layer achieves validation loss within 0.024 of a dense baseline on Shakespeare character prediction, outperforming matched-parameter LoRA due to preserved rank flexibility.