Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.
Efficient joint prediction of multiple future tokens.arXiv preprint arXiv:2503.21801
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2representative citing papers
citing papers explorer
-
How Transformers Learn to Plan via Multi-Token Prediction
Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.
- Next-Latent Prediction Transformers Learn Compact World Models