MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.
MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped transformers.
GRASP is a scalable method for subset-level data attribution in pretraining that models interactions via a geometry-aware quadratic penalty and claims to double rank correlation while cutting costs.
citing papers explorer
-
MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
-
Internal Data Repetition Destroys Language Models
Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.
-
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
-
Solve the Loop: Attractor Models for Language and Reasoning
Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped transformers.
-
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution
GRASP is a scalable method for subset-level data attribution in pretraining that models interactions via a geometry-aware quadratic penalty and claims to double rank correlation while cutting costs.