hub Canonical reference

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan, Yuanzhi Li · 2024 · arXiv 2305.07759

Canonical reference. 83% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 1

citation-polarity summary

background 5 use dataset 1

representative citing papers

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

cs.CR · 2026-04-29 · conditional · novelty 7.0

Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.

Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

cs.LG · 2026-04-17 · unverdicted · novelty 7.0 · 2 refs

Neural CTMC decouples jump timing and direction in continuous-time Markov chain diffusion via dedicated heads, achieving lower perplexity on TinyStories (16.36) and OpenWebText than GIDD or MDLM at equivalent training budgets.

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

cs.CL · 2026-01-27 · unverdicted · novelty 7.0

Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.

How does the optimizer implicitly bias the model merging loss landscape?

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.

RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.

SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

cs.CR · 2025-09-30 · unverdicted · novelty 7.0

SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.

All is Not Lost: LLM Recovery without Checkpoints

cs.DC · 2025-06-18 · conditional · novelty 7.0

CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

cs.CL · 2025-04-29 · unverdicted · novelty 7.0

The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.

Towards Human-Level Book-Writing Capability

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.

Primal-Dual Guided Decoding for Constrained Discrete Diffusion

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained model distribution.

Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

cs.LG · 2026-05-09 · conditional · novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

TextLDM: Language Modeling with Continuous Latent Diffusion

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

cs.CR · 2026-04-15 · unverdicted · novelty 6.0

BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.

Latent Planning Emerges with Scale

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

Latent planning ability in LLMs emerges and strengthens with scale, shown through internal features that represent future words and influence token choices on planning and rhyming tasks.

Differences in Text Generated by Diffusion and Autoregressive Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

stat.ML · 2025-05-30 · unverdicted · novelty 6.0

Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

cs.CL · 2023-09-21 · conditional · novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

Textbooks Are All You Need II: phi-1.5 technical report

cs.CL · 2023-09-11 · unverdicted · novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.

Textbooks Are All You Need

cs.CL · 2023-06-20 · unverdicted · novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

Seed Bank, Co-op, Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing

cs.HC · 2026-05-13 · unverdicted · novelty 5.0

Workshops with over 100 creative writers produced metaphors and four themes for language model governance that favor consent-driven, smaller open models encoding community values.

Path Integral Solution for Dissipative Generative Dynamics

cs.LG · 2025-12-30 · unverdicted · novelty 5.0

Language generation requires dissipative quantum dynamics with non-local aggregation, not conservation laws, framing it as dissipative quantum field theory.

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

cs.CL · 2025-01-03 · unverdicted · novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

Next-Latent Prediction Transformers Learn Compact World Models

cs.LG · 2025-11-08

citing papers explorer

Showing 23 of 23 citing papers.

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch cs.CR · 2026-04-29 · conditional · none · ref 19 · internal anchor
Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction cs.LG · 2026-04-17 · unverdicted · none · ref 2 · 2 links · internal anchor
Neural CTMC decouples jump timing and direction in continuous-time Markov chain diffusion via dedicated heads, achieving lower perplexity on TinyStories (16.36) and OpenWebText than GIDD or MDLM at equivalent training budgets.
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability cs.CL · 2026-01-27 · unverdicted · none · ref 6 · internal anchor
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.
How does the optimizer implicitly bias the model merging loss landscape? cs.LG · 2025-10-06 · unverdicted · none · ref 2 · internal anchor
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts cs.LG · 2025-10-05 · unverdicted · none · ref 13 · internal anchor
RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From cs.CR · 2025-09-30 · unverdicted · none · ref 2 · internal anchor
SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.
All is Not Lost: LLM Recovery without Checkpoints cs.DC · 2025-06-18 · conditional · none · ref 8 · internal anchor
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models cs.CL · 2025-04-29 · unverdicted · none · ref 4 · internal anchor
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
Towards Human-Level Book-Writing Capability cs.AI · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.
Primal-Dual Guided Decoding for Constrained Discrete Diffusion cs.AI · 2026-05-10 · unverdicted · none · ref 40 · internal anchor
Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained model distribution.
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World cs.LG · 2026-05-09 · conditional · none · ref 15 · internal anchor
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
TextLDM: Language Modeling with Continuous Latent Diffusion cs.CL · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models cs.CR · 2026-04-15 · unverdicted · none · ref 39 · internal anchor
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
Latent Planning Emerges with Scale cs.CL · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
Latent planning ability in LLMs emerges and strengthens with scale, shown through internal features that represent future words and influence token choices on planning and rhyming tasks.
Differences in Text Generated by Diffusion and Autoregressive Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 7 · internal anchor
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation stat.ML · 2025-05-30 · unverdicted · none · ref 6 · internal anchor
Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 18 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Textbooks Are All You Need II: phi-1.5 technical report cs.CL · 2023-09-11 · unverdicted · none · ref 10 · internal anchor
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
Textbooks Are All You Need cs.CL · 2023-06-20 · unverdicted · none · ref 12 · internal anchor
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
Seed Bank, Co-op, Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing cs.HC · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
Workshops with over 100 creative writers produced metaphors and four themes for language model governance that favor consent-driven, smaller open models encoding community values.
Path Integral Solution for Dissipative Generative Dynamics cs.LG · 2025-12-30 · unverdicted · none · ref 30 · internal anchor
Language generation requires dissipative quantum dynamics with non-local aggregation, not conservation laws, framing it as dissipative quantum field theory.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026) cs.CL · 2025-01-03 · unverdicted · none · ref 32 · internal anchor
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.
Next-Latent Prediction Transformers Learn Compact World Models cs.LG · 2025-11-08 · unreviewed · ref 7 · internal anchor

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer