Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
mega hub Canonical reference
LLaMA: Open and Efficient Foundation Language Models
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.
First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.
BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.
SPARE reformulates visual token pruning as column subset selection to minimize reconstruction error and uses anti-relevance for context-aware selection in VLMs.
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
citing papers explorer
-
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
-
Bayesian Fine-tuning in Projected Subspaces
Bayesian fine-tuning of large models can be done efficiently by projecting uncertainties into low-dimensional subspaces, yielding improved calibration and generalization while keeping computational costs low.
-
LARAG: Link-Aware Retrieval Strategy for RAG Systems in Hyperlinked Technical Documentation
LARAG improves RAG answer quality on hyperlinked technical documentation by using author-defined links for retrieval, achieving higher BERTScore while using fewer chunks and tokens than standard embedding-based RAG.
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
-
An Embarrassingly Simple Graph Heuristic Reveals Shortcut-Solvable Benchmarks for Sequential Recommendation
A simple graph heuristic without training or sequence encoders matches or outperforms trained generative recommenders on 10 of 14 sequential recommendation benchmarks by exploiting local transition and feature shortcuts.
-
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of the two static methods on SQL, medical QA, and counterfactual tasks while an efficient variant outperforms prior adaptive LoRA by up to 20%.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Can LLMs Predict Polymer Physics Just by Reading Synthesis and Processing Prose?
PolyLM fine-tunes a 9B-parameter LLM on 185k papers to predict polymer properties from text alone, achieving median R² of 0.74 on 68k held-out samples.
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
-
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.
-
More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs
Newer LLMs exhibit reduced syntactic and lexical diversity in English news text generation compared to older models, as measured by HPSG grammar and diversity metrics from ecology and information theory, while human-authored text shows little change.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
-
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
-
Tree-based Credit Assignment for Multi-Agent Memory System
TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
Budget-aware Auto Optimizer Configurator
BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.
-
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
-
FINER-SQL: Boosting Small Language Models for Text-to-SQL
FINER-SQL boosts 3B-parameter small language models to 67.73% and 85% execution accuracy on BIRD and Spider benchmarks via dense memory and atomic rewards in group relative policy optimization, matching larger LLMs at lower latency.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
-
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
-
Anon: Extrapolating Adaptivity Beyond SGD and Adam
Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.
-
How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing
GitHub Actions workflows achieve only 28% overall compliance with best practices, with LLMs enabling an 81% reduction in verification effort via hybrid adjudication but still requiring expert oversight for security judgments.
-
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
-
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
-
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
-
METASYMBO: Multi-Agent Language-Guided Metamaterial Discovery via Symbolic Latent Evolution
MetaSymbO proposes a three-agent framework with symbolic latent evolution that improves structural validity and language alignment for metamaterial design from free-form text intents.
-
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
-
Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding
MCM-VG achieves state-of-the-art zero-shot 3D visual grounding on ScanRefer and Nr3D by creating consistent 2D-3D mappings across semantic, geometric, and viewpoint dimensions using LLMs and VLMs.
-
TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation
TimeMM proposes a time-as-operator spectral filtering framework with adaptive mixing and modality routing to model non-stationary multimodal user preferences in recommendation systems.
-
Structural Generalization on SLOG without Hand-Written Rules
A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.
-
A Survey on LLM-based Conversational User Simulation
A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
-
ViPO: Visual Preference Optimization at Scale
Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
-
BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models
BaLoRA is a Bayesian LoRA variant with input-adaptive noise that improves accuracy over standard LoRA and supplies well-calibrated uncertainty estimates on language, vision, and scientific prediction tasks.
-
X2SAM: Any Segmentation in Images and Videos
X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.