NITP adds dense supervision from shallow model layers to predict implicit next-token semantics, yielding consistent downstream gains on 0.5B-9B models with ~2% extra training FLOPs.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.
citing papers explorer
-
NITP: Next Implicit Token Prediction for LLM Pre-training
NITP adds dense supervision from shallow model layers to predict implicit next-token semantics, yielding consistent downstream gains on 0.5B-9B models with ~2% extra training FLOPs.
-
Efficient Pre-Training with Token Superposition
Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.