mega hub Canonical reference

LLaMA: Open and Efficient Foundation Language Models

· 2023 · cs.CL · arXiv 2302.13971

Canonical reference. 82% of citing Pith papers cite this work as background.

1089 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 1089 citing papers arXiv PDF

abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 206 method 19 baseline 8 other 6 dataset 1 extension 1

citation-polarity summary

background 198 use method 20 unclear 13 baseline 7 extend 1 support 1 use dataset 1

claims ledger

abstract We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Backdoor Attacks on Decentralised Post-Training

cs.CR · 2026-03-31 · conditional · novelty 8.0 · 2 refs

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

cs.SE · 2025-06-16 · conditional · novelty 8.0

First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

BEAVER: An Enterprise Benchmark for Text-to-SQL

cs.CL · 2024-09-03 · unverdicted · novelty 8.0

BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

PatternGSL is a new template-free specification language for complete sewing patterns that enables direct single-image prediction of simulation-ready garments via a vision-language model, supported by a new 300K paired dataset.

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

SPARE reformulates visual token pruning as column subset selection to minimize reconstruction error and uses anti-relevance for context-aware selection in VLMs.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.

citing papers explorer

Showing 50 of 1089 citing papers.

TempCompass: Do Video LLMs Really Understand Videos? cs.CV · 2024-03-01 · unverdicted · none · ref 122 · internal anchor
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning cs.CV · 2024-02-20 · unverdicted · none · ref 47 · internal anchor
VADv2 introduces a probabilistic planning model that discretizes the high-dimensional action space into tokens, interacts them with scene tokens to predict action distributions, and reports SOTA closed-loop results on CARLA Town05 and Bench2Drive.
DoRA: Weight-Decomposed Low-Rank Adaptation cs.CL · 2024-02-14 · accept · none · ref 112 · internal anchor
DoRA improves LoRA by decomposing weights into magnitude and direction and updating only direction with low-rank matrices, closing much of the gap to full fine-tuning.
Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora on Reddit cs.HC · 2024-02-09 · unverdicted · none · ref 64 · internal anchor
Teleoscope enables thematic curation of large Reddit corpora via interactive refinement, with three deployments indicating benefits in serendipitous keyword discovery, search saturation confidence, and collaborative curation discussions.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 177 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache cs.CL · 2024-02-05 · conditional · none · ref 18 · internal anchor
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 35 · internal anchor
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Corrective Retrieval Augmented Generation cs.CL · 2024-01-29 · unverdicted · none · ref 34 · internal anchor
CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generation tasks.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents cs.HC · 2024-01-17 · unverdicted · none · ref 97 · internal anchor
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation cs.RO · 2023-12-20 · conditional · none · ref 146 · internal anchor
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 106 · internal anchor
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 93 · internal anchor
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models cs.CL · 2023-12-10 · unverdicted · none · ref 21 · internal anchor
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark cs.CV · 2023-11-28 · accept · none · ref 72 · internal anchor
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 111 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 68 · internal anchor
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models eess.AS · 2023-11-14 · unverdicted · none · ref 34 · internal anchor
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
Vision-Language Foundation Models as Effective Robot Imitators cs.RO · 2023-11-02 · conditional · none · ref 20 · internal anchor
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
SALMONN: Towards Generic Hearing Abilities for Large Language Models cs.SD · 2023-10-20 · unverdicted · none · ref 48 · internal anchor
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
BitNet: Scaling 1-bit Transformers for Large Language Models cs.CL · 2023-10-17 · unverdicted · none · ref 17 · internal anchor
BitNet shows that 1-bit Transformers can match the performance of 8-bit and FP16 models on language modeling with much smaller memory footprint and energy use, while following a similar scaling law.
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs cs.CL · 2023-10-03 · conditional · none · ref 97 · internal anchor
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
GPT-Driver: Learning to Drive with GPT cs.CV · 2023-10-02 · conditional · none · ref 15 · internal anchor
GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models cs.LG · 2023-10-01 · conditional · none · ref 17 · internal anchor
LURE reduces object hallucination in LVLMs by 23% via post-hoc revision informed by co-occurrence, uncertainty, and text position analysis.
Efficient Streaming Language Models with Attention Sinks cs.CL · 2023-09-29 · accept · none · ref 51 · internal anchor
StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 43 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
Aligning Large Multimodal Models with Factually Augmented RLHF cs.CV · 2023-09-25 · conditional · none · ref 31 · internal anchor
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 69 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Chain-of-Verification Reduces Hallucination in Large Language Models cs.CL · 2023-09-20 · unverdicted · none · ref 183 · internal anchor
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
Language Modeling Is Compression cs.LG · 2023-09-19 · accept · none · ref 15 · internal anchor
Large language models serve as strong general-purpose lossless compressors for text, images, and audio, outperforming domain-specific methods and revealing insights into scaling, tokenization, and in-context learning.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 44 · internal anchor
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Textbooks Are All You Need II: phi-1.5 technical report cs.CL · 2023-09-11 · unverdicted · none · ref 22 · internal anchor
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models cs.CL · 2023-09-07 · conditional · none · ref 55 · internal anchor
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models cs.LG · 2023-09-01 · conditional · none · ref 55 · internal anchor
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
YaRN: Efficient Context Window Extension of Large Language Models cs.CL · 2023-08-31 · unverdicted · none · ref 14 · internal anchor
YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation beyond fine-tuning lengths.
A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 9 · internal anchor
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models cs.CR · 2023-08-07 · unverdicted · none · ref 81 · internal anchor
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents cs.SI · 2023-07-27 · unverdicted · none · ref 39 · internal anchor
S³ uses LLM agents to simulate social networks by modeling emotion, attitude, and interaction, producing emergent propagation phenomena with promising accuracy on real data.
MMBench: Is Your Multi-modal Model an All-around Player? cs.CV · 2023-07-12 · accept · none · ref 46 · internal anchor
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis cs.CV · 2023-07-04 · conditional · none · ref 50 · internal anchor
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning cs.CV · 2023-06-26 · accept · none · ref 19 · internal anchor
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 40 · internal anchor
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models cs.CV · 2023-06-23 · unverdicted · none · ref 43 · internal anchor
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 22 · internal anchor
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena cs.CL · 2023-06-09 · accept · none · ref 39 · internal anchor
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 cs.CL · 2023-06-05 · conditional · none · ref 10 · internal anchor
A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023-06-01 · unverdicted · none · ref 39 · internal anchor
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration cs.CL · 2023-06-01 · conditional · none · ref 30 · internal anchor
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day cs.CV · 2023-06-01 · unverdicted · none · ref 40 · internal anchor
LLaVA-Med is created via curriculum fine-tuning on PubMed figure-caption pairs and GPT-4 self-instructed data, achieving competitive or better results than prior supervised models on three biomedical VQA benchmarks.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 118 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
The False Promise of Imitating Proprietary LLMs cs.CL · 2023-05-25 · conditional · none · ref 253 · internal anchor
Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.

LLaMA: Open and Efficient Foundation Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

mega hub controls

Recognition alignment

counterfactual ablation

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer