LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron , Thibaut Lavril , Gautier Izacard , Xavier Martinet , Marie-Anne Lachaux , Timoth\'ee Lacroix , Baptiste Rozi\`ere , Naman Goyal

show 6 more authors

Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave Guillaume Lample

Authors on Pith no claims yet

classification 💻 cs.CL

keywords modelsdatasetsfoundationlanguagellamatrainavailablebenchmarks

0 comments

read the original abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Privacy Auditing with Zero (0) Training Run
cs.CR 2026-05 unverdicted novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
cs.LG 2026-05 unverdicted novelty 8.0

Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
cs.LG 2026-05 accept novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Backdoor Attacks on Decentralised Post-Training
cs.CR 2026-03 conditional novelty 8.0

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
cs.CL 2023-05 accept novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
cs.CR 2026-05 unverdicted novelty 7.0

MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
SurF: A Generative Model for Multivariate Irregular Time Series Forecasting
cs.LG 2026-05 unverdicted novelty 7.0

SurF applies the Time Rescaling Theorem as a learnable bijection to create a single generative model for forecasting irregular multivariate event streams that outperforms or matches baselines on six benchmarks.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
cs.CV 2026-05 conditional novelty 7.0

LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
cs.CL 2026-05 unverdicted novelty 7.0

Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Efficient and Adaptive Human Activity Recognition via LLM Backbones
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained LLMs adapted via convolutional projections and LoRA act as efficient frozen backbones for sensor-based human activity recognition, delivering strong data efficiency and cross-dataset transfer.
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
cs.CV 2026-05 unverdicted novelty 7.0

DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Variance-aware Reward Modeling with Anchor Guidance
stat.ML 2026-05 unverdicted novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
OZ-TAL: Online Zero-Shot Temporal Action Localization
cs.CV 2026-05 unverdicted novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
cs.AI 2026-05 unverdicted novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression
cs.LG 2026-05 unverdicted novelty 7.0

Standard softmax-attention transformers can approximate the Gaussian kernel ridge regression predictor by implementing preconditioned Richardson iteration during their forward pass.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
cs.CL 2026-05 unverdicted novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
cs.CL 2026-05 unverdicted novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
Rollback-Free Stable Brick Structures Generation
cs.LG 2026-05 unverdicted novelty 7.0

Reinforcement learning internalizes physical stability rules for brick structures, enabling the first rollback-free generation with orders-of-magnitude faster inference.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
cs.CL 2026-05 unverdicted novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 7.0

Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
cs.LG 2026-05 conditional novelty 7.0

Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
When Graph Language Models Go Beyond Memorization
cs.LG 2026-05 conditional novelty 7.0

Large-scale graph language models acquire structural regularities beyond memorization, with subgraph rank correlations persisting after bootstrap and novel-subset controls, especially for high-frequency patterns.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.
OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning
q-bio.GN 2026-05 unverdicted novelty 7.0

OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real ex...
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency
stat.ML 2026-05 unverdicted novelty 7.0

CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
cs.LG 2026-05 unverdicted novelty 7.0

AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
cs.CR 2026-05 conditional novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise
cs.CL 2026-05 unverdicted novelty 7.0

NoisyCausal benchmark tests LLMs on causal reasoning with structured noise, and a modular LLM-plus-causal-graph framework outperforms baselines while generalizing to Cladder.
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
cs.LG 2026-05 unverdicted novelty 7.0

Echo-LoRA raises average performance on eight commonsense reasoning benchmarks by 3.0 to 5.7 points over standard LoRA by using a training-only cross-layer echo representation that is discarded after training.