CBD is an API-only black-box unlearning method for LLMs that creates controlled behavioral divergence with auxiliary models and uses a Fisher-matrix-derived discriminative basis to balance forgetting target data with retained utility.
hub Canonical reference
TinyLlama: An Open-Source Small Language Model
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
Transformer representations form trajectories showing semantic convergence in middle-to-late layers, higher curvature on reasoning tasks, bifurcation on ambiguous tokens, and a consistent three-phase cosine similarity pattern across GPT-2, TinyLlama, and Qwen2.5.
CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
Bayesian Filtering Transformer reframes attention as precision-weighted kriging and residual connections as Kalman updates, delivering gains on cold-start recommendation and noisy LLM fine-tuning tasks.
PASA is an embedding-space watermarking method for LLM text that uses semantic clusters and synchronized randomness to achieve robustness against paraphrasing while remaining distortion-free.
Strict regex parsing of LLM security log outputs introduces systematic errors that can make functional models appear non-functional, with a 76-point accuracy gap recovered by fuzzy parsing.
Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods on WordNet, DBLP, and SemEval-Sci benchmarks.
Replication of TCS strategies on 17 LLM instances across three code tasks shows only partial generalization from vision DNN results, with uncertainty features aiding early failure discovery and representation features aiding accuracy estimation.
PRIME is a new evaluation framework that creates calibrated conflicts in LLM prompts and finds conflict type affects model behavior more than scale.
Persistent homology analysis of LLM activations shows most topological reorganization occurs early in fine-tuning, with a transient peak followed by stabilization and distinct trajectories for different alignment objectives.
ARIADNE routes queries to the best adapter via embedding-space centroid proximity, recovering 97.44% of upper-bound performance on 23 NLP tasks and 89.7% selection accuracy on 44 tasks without training or internal access.
RegMix-D fits regression models to proxy loss trajectories to produce dynamic data mixture schedules that outperform static RegMix and DoReMi on 25B-token Pile pretraining with a 1B model.
BLADE converts influence-based bi-level data selection into a Hessian-free penalized objective with a dynamic reference model, proves first-order convergence, and reports better performance than prior methods on LLM training.
A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.
A differentiable NAS framework jointly optimizes LLM architecture and mixed-precision quantization for linear layers, yielding up to 1.4x faster inference or 6% higher accuracy than sequential baselines on reasoning tasks.
MOC formalizes a multi-order evidence stream and Semantic-Topological Merging algorithm that improves task performance while cutting communication costs on six datasets.
Hierarchical SSM architecture Harmonic outperforms Transformers and Mamba on long-context language modeling up to 64K tokens and removes RoPE limits at 1B scale while maintaining O(L) compute.
Including temperature scaling makes forward KL divergence outperform reverse KL in LLM distillation on instruction benchmarks, overturning the τ=1 preference for reverse KL.
DareU reframes LLM unlearning as zeroing data attribution via RL rewards from an LLM classifier approximation, claiming better balance of forget quality and model utility than loss-based baselines.
Even small or undertrained teachers improve larger LLM students via distillation with tuned loss mixing, while stronger teachers can saturate or reverse gains and distillation aids generalization more than in-domain fit.
citing papers explorer
-
CBD: API-Only LLM Black-Box Unlearning through Controlled Behavioral Divergence
CBD is an API-only black-box unlearning method for LLMs that creates controlled behavioral divergence with auxiliary models and uses a Fisher-matrix-derived discriminative basis to balance forgetting target data with retained utility.
-
Explaining Attention with Program Synthesis
Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
-
Trajectory Geometry of Transformer Representations Across Layers
Transformer representations form trajectories showing semantic convergence in middle-to-late layers, higher curvature on reasoning tasks, bifurcation on ambiguous tokens, and a consistent three-phase cosine similarity pattern across GPT-2, TinyLlama, and Qwen2.5.
-
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments
CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
-
Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models
Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.
-
Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise
Bayesian Filtering Transformer reframes attention as precision-weighted kriging and residual connections as Kalman updates, delivering gains on cold-start recommendation and noisy LLM fine-tuning tasks.
-
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
PASA is an embedding-space watermarking method for LLM text that uses semantic clusters and synchronized randomness to achieve robustness against paraphrasing while remaining distortion-free.
-
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
-
BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration
BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods on WordNet, DBLP, and SemEval-Sci benchmarks.
-
Test Case Selection for Deep Neural Networks: A Replication Study on LLMs for Code
Replication of TCS strategies on 17 LLM instances across three code tasks shows only partial generalization from vision DNN results, with uncertainty features aiding early failure discovery and representation features aiding accuracy estimation.
-
PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLMs
PRIME is a new evaluation framework that creates calibrated conflicts in LLM prompts and finds conflict type affects model behavior more than scale.
-
Tracking Representation Dynamics in Large Language Models with Persistent Homology
Persistent homology analysis of LLM activations shows most topological reorganization occurs early in fine-tuning, with a transient peak followed by stabilization and distinct trajectories for different alignment objectives.
-
ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection
ARIADNE routes queries to the best adapter via embedding-space centroid proximity, recovering 97.44% of upper-bound performance on 23 NLP tasks and 89.7% selection accuracy on 44 tasks without training or internal access.
-
RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
RegMix-D fits regression models to proxy loss trajectories to produce dynamic data mixture schedules that outperform static RegMix and DoReMi on 25B-token Pile pretraining with a 1B model.
-
BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training
BLADE converts influence-based bi-level data selection into a Hessian-free penalized objective with a dynamic reference model, proves first-order convergence, and reports better performance than prior methods on LLM training.
-
Explaining Data Mixing Scaling Laws
A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.
-
LLM Compression with Jointly Optimizing Architectural and Quantization choices
A differentiable NAS framework jointly optimizes LLM architecture and mixed-precision quantization for linear layers, yielding up to 1.4x faster inference or 6% higher accuracy than sequential baselines on reasoning tasks.
-
MOC: Multi-Order Communication in LLM-based Multi-Agent Systems
MOC formalizes a multi-order evidence stream and Semantic-Topological Merging algorithm that improves task performance while cutting communication costs on six datasets.
-
Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling
Hierarchical SSM architecture Harmonic outperforms Transformers and Mamba on long-context language modeling up to 64K tokens and removes RoPE limits at 1B scale while maintaining O(L) compute.
-
Rethinking the Role of Temperature in Large Language Model Distillation
Including temperature scaling makes forward KL divergence outperform reverse KL in LLM distillation on instruction benchmarks, overturning the τ=1 preference for reverse KL.
-
De-attribute to Forget for LLM Unlearning
DareU reframes LLM unlearning as zeroing data attribution via RL rewards from an LLM classifier approximation, claiming better balance of forget quality and model utility than loss-based baselines.
-
Strong Teacher Not Needed? On Distillation in LLM Pretraining
Even small or undertrained teachers improve larger LLM students via distillation with tuned loss mixing, while stronger teachers can saturate or reverse gains and distillation aids generalization more than in-domain fit.
-
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
Hyperfitting improves LLM generation via context-dependent rank reordering from geometric expansion in the terminal transformer block, distinct from temperature scaling, and enables efficient Late-Stage LoRA fine-tuning.
-
SNLP: Layer-Parallel Inference via Structured Newton Corrections
SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
-
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
Sentinel-VLA adds metacognitive status monitoring to VLA models for on-demand reasoning and error recovery, reporting over 30% higher real-world task success than prior SOTA.
-
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
-
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
-
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user escalation.
-
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
-
Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding
Speculative Verification adds a companion model that estimates draft-target alignment via information gain to dynamically set verification length, delivering up to 2x speedup over standard speculative decoding across tested models and batch sizes.
-
InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
InvisibleInk achieves high-utility differentially private long-form LLM text generation at 4-8x the cost of non-private generation by isolating and clipping sensitive logits and sampling from a small superset of top-k private tokens without privacy cost.
-
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
-
Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs
AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).
-
The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Full fine-tuning causes negative transfer and performance collapse in sub-300M SLMs on math tasks, establishing PEFT as a stability requirement.
-
Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs
A spike-aware C++ INT8 runtime for sparse spiking LMs delivers 22.63 tokens/s single-thread on Ryzen 7, beating several Q8_0 dense models in llama.cpp while cutting weights from 3.49 GB to 1.06 GB, at the cost of higher perplexity.
-
MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency
MOSAIC uses an Integer Linear Program scheduler for expert placement and prompt assignment plus adaptive aggregation to achieve 1.7-2.3x end-to-end speedup on 4-GPU MoA workloads while keeping accuracy within 0.1pp.
-
Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition
Introduces a lightweight gravity-aware routing head that improves macro-F1 on static classes in compressed SensorLLM for human activity recognition on the MHealth dataset.
-
Representation Collapse in Sequential Post-Training of Large Language Models
Sequential post-training of LLMs induces representation collapse that correlates with reduced plasticity, weaker generalization, and poorer calibration, with lightweight interventions tested to mitigate it.
-
Spectral structural distortion reveals redundant neurons in neural networks
A graph-spectral importance score based on layer-wise structural distortion between pre- and post-activation neuron graphs identifies removable neurons for iterative pruning without intermediate updates, followed by recovery fine-tuning.
-
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
-
TabEmb: Joint Semantic-Structure Embedding for Table Annotation
TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Acceptance Dynamics Across Cognitive Domains in Speculative Decoding
Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.