pith. sign in

hub Canonical reference

TinyLlama: An Open-Source Small Language Model

Canonical reference. 100% of citing Pith papers cite this work as background.

79 Pith papers citing it
Background 100% of classified citations
abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

hub tools

citation-role summary

background 5

citation-polarity summary

roles

background 5

polarities

background 5

clear filters

representative citing papers

Explaining Attention with Program Synthesis

cs.LG · 2026-06-17 · unverdicted · novelty 7.0 · 2 refs

Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.

Trajectory Geometry of Transformer Representations Across Layers

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Transformer representations form trajectories showing semantic convergence in middle-to-late layers, higher curvature on reasoning tasks, bifurcation on ambiguous tokens, and a consistent three-phase cosine similarity pattern across GPT-2, TinyLlama, and Qwen2.5.

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

BLADE converts influence-based bi-level data selection into a Hessian-free penalized objective with a dynamic reference model, proves first-order convergence, and reports better performance than prior methods on LLM training.

Explaining Data Mixing Scaling Laws

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.

De-attribute to Forget for LLM Unlearning

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

DareU reframes LLM unlearning as zeroing data attribution via RL rewards from an LLM classifier approximation, claiming better balance of forget quality and model utility than loss-based baselines.

Strong Teacher Not Needed? On Distillation in LLM Pretraining

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Even small or undertrained teachers improve larger LLM students via distillation with tuned loss mixing, while stronger teachers can saturate or reverse gains and distillation aids generalization more than in-domain fit.

citing papers explorer

Showing 50 of 73 citing papers after filters.