hub Canonical reference

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu · 2024 · cs.CL · arXiv 2401.02385

Canonical reference. 100% of citing Pith papers cite this work as background.

79 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 79 citing papers arXiv PDF

abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

CBD: API-Only LLM Black-Box Unlearning through Controlled Behavioral Divergence

cs.LG · 2026-06-26 · unverdicted · novelty 7.0

CBD is an API-only black-box unlearning method for LLMs that creates controlled behavioral divergence with auxiliary models and uses a Fisher-matrix-derived discriminative basis to balance forgetting target data with retained utility.

Explaining Attention with Program Synthesis

cs.LG · 2026-06-17 · unverdicted · novelty 7.0 · 2 refs

Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.

Trajectory Geometry of Transformer Representations Across Layers

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Transformer representations form trajectories showing semantic convergence in middle-to-late layers, higher curvature on reasoning tasks, bifurcation on ambiguous tokens, and a consistent three-phase cosine similarity pattern across GPT-2, TinyLlama, and Qwen2.5.

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

cs.LG · 2026-05-14 · conditional · novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Bayesian Filtering Transformer reframes attention as precision-weighted kriging and residual connections as Kalman updates, delivering gains on cold-start recommendation and noisy LLM fine-tuning tasks.

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

cs.CR · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

PASA is an embedding-space watermarking method for LLM text that uses semantic clusters and synchronized randomness to achieve robustness against paraphrasing while remaining distortion-free.

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

cs.CR · 2026-05-08 · conditional · novelty 7.0

Strict regex parsing of LLM security log outputs introduces systematic errors that can make functional models appear non-functional, with a 76-point accuracy gap recovered by fuzzy parsing.

Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

cs.DC · 2026-05-01 · unverdicted · novelty 7.0

Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods on WordNet, DBLP, and SemEval-Sci benchmarks.

Test Case Selection for Deep Neural Networks: A Replication Study on LLMs for Code

cs.SE · 2026-06-25 · unverdicted · novelty 6.0

Replication of TCS strategies on 17 LLM instances across three code tasks shows only partial generalization from vision DNN results, with uncertainty features aiding early failure discovery and representation features aiding accuracy estimation.

PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLMs

cs.AI · 2026-06-21 · unverdicted · novelty 6.0

PRIME is a new evaluation framework that creates calibrated conflicts in LLM prompts and finds conflict type affects model behavior more than scale.

Tracking Representation Dynamics in Large Language Models with Persistent Homology

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

Persistent homology analysis of LLM activations shows most topological reorganization occurs early in fine-tuning, with a transient peak followed by stabilization and distinct trajectories for different alignment objectives.

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

ARIADNE routes queries to the best adapter via embedding-space centroid proximity, recovering 97.44% of upper-bound performance on 23 NLP tasks and 89.7% selection accuracy on 44 tasks without training or internal access.

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

RegMix-D fits regression models to proxy loss trajectories to produce dynamic data mixture schedules that outperform static RegMix and DoReMi on 25B-token Pile pretraining with a 1B model.

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

BLADE converts influence-based bi-level data selection into a Hessian-free penalized objective with a dynamic reference model, proves first-order convergence, and reports better performance than prior methods on LLM training.

Explaining Data Mixing Scaling Laws

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.

LLM Compression with Jointly Optimizing Architectural and Quantization choices

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

A differentiable NAS framework jointly optimizes LLM architecture and mixed-precision quantization for linear layers, yielding up to 1.4x faster inference or 6% higher accuracy than sequential baselines on reasoning tasks.

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

MOC formalizes a multi-order evidence stream and Semantic-Topological Merging algorithm that improves task performance while cutting communication costs on six datasets.

Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

Hierarchical SSM architecture Harmonic outperforms Transformers and Mamba on long-context language modeling up to 64K tokens and removes RoPE limits at 1B scale while maintaining O(L) compute.

Rethinking the Role of Temperature in Large Language Model Distillation

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Including temperature scaling makes forward KL divergence outperform reverse KL in LLM distillation on instruction benchmarks, overturning the τ=1 preference for reverse KL.

De-attribute to Forget for LLM Unlearning

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

DareU reframes LLM unlearning as zeroing data attribution via RL rewards from an LLM classifier approximation, claiming better balance of forget quality and model utility than loss-based baselines.

Strong Teacher Not Needed? On Distillation in LLM Pretraining

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Even small or undertrained teachers improve larger LLM students via distillation with tuned loss mixing, while stronger teachers can saturate or reverse gains and distillation aids generalization more than in-domain fit.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

TinyLlama: An Open-Source Small Language Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer