Mistral 7B
Pith reviewed 2026-05-24 06:09 UTC · model grok-4.3
The pith
A 7-billion-parameter model outperforms Llama 2 13B on every benchmark and Llama 1 34B on reasoning, mathematics, and code generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mistral 7B v0.1 is a 7-billion-parameter model that uses grouped-query attention and sliding window attention to outperform Llama 2 13B across all evaluated benchmarks and Llama 1 34B in reasoning, mathematics, and code generation. The instruction-tuned Mistral 7B Instruct version surpasses Llama 2 13B Chat on both human and automated benchmarks. The models are released under the Apache 2.0 license.
What carries the argument
Grouped-query attention paired with sliding window attention, which together reduce inference cost while supporting long sequences.
If this is right
- Models of 7B parameters can exceed 13B-parameter models on common tasks when attention mechanisms are adjusted.
- Inference speed improves without sacrificing sequence length handling.
- Instruction tuning applied to the base model produces a chat version stronger than the corresponding Llama 2 variant.
- Open release under Apache 2.0 allows direct use and further fine-tuning by others.
Where Pith is reading between the lines
- Efficiency-focused designs may shift emphasis from raw scale toward architecture in future model development.
- Smaller models become more practical for on-device or low-resource deployment.
- Continued benchmark saturation could prompt creation of harder, more leakage-resistant evaluation sets.
Load-bearing premise
The chosen benchmarks and test sets measure genuine downstream usefulness and contain no overlap with the training data.
What would settle it
A new benchmark suite free of training-data overlap on which Mistral 7B scores below Llama 2 13B, or direct evidence that the original test sets leaked into training.
Figures
read the original abstract
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mistral 7B, a 7B-parameter language model that outperforms Llama 2 13B across all evaluated benchmarks and Llama 1 34B on reasoning, mathematics, and code generation. It employs grouped-query attention (GQA) and sliding window attention (SWA) for inference efficiency and long-sequence handling. An instruction-tuned variant (Mistral 7B -- Instruct) is also presented and shown to surpass Llama 2 13B -- Chat on human and automated benchmarks. The models are released under Apache 2.0.
Significance. If the empirical results hold, the work demonstrates that targeted architectural choices can enable smaller models to match or exceed larger ones on standard tasks, with direct implications for efficient deployment. The open release of weights supports independent verification and extension, which strengthens the contribution.
major comments (1)
- [Evaluation] Evaluation section: the headline claim that Mistral 7B outperforms Llama 2 13B on every reported benchmark rests entirely on the numerical scores, yet the manuscript provides no description of n-gram decontamination, membership-inference checks, or confirmation that the test sets (MMLU, HumanEval, GSM8K, etc.) were held out from training data. This directly affects the reliability of the reported margins.
minor comments (2)
- No training corpus composition, token count, or hyperparameter details are supplied, limiting reproducibility and contextualization of the efficiency claims.
- Benchmark tables lack error bars or multiple-run statistics, making it impossible to assess whether the observed differences are statistically meaningful.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation section. We address the concern below and will incorporate clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the headline claim that Mistral 7B outperforms Llama 2 13B on every reported benchmark rests entirely on the numerical scores, yet the manuscript provides no description of n-gram decontamination, membership-inference checks, or confirmation that the test sets (MMLU, HumanEval, GSM8K, etc.) were held out from training data. This directly affects the reliability of the reported margins.
Authors: We agree that the manuscript lacks explicit details on these points, which is a valid concern for transparency. In the revised version we will add a dedicated paragraph in the Evaluation section describing our internal data curation process, including n-gram overlap checks performed to reduce contamination with the listed benchmarks and confirmation that the test sets were excluded from training. We did not run membership-inference attacks, as they are not standard practice in the majority of contemporaneous LLM papers and would require substantial additional compute; we will note this limitation explicitly. revision: yes
Circularity Check
No circularity: empirical model release with benchmark comparisons
full rationale
The manuscript introduces Mistral 7B, describes architectural choices (GQA, SWA), and reports empirical benchmark results against prior models. No derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear. The central claim is a direct empirical comparison of released weights; it is self-contained against external benchmarks and contains no steps that reduce to inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis
RTI-Bench is the first publicly released structured dataset of CIC administrative decisions with outcome labels, exemption citations, IRAC reasoning, and timelines, built from 1,218 corpus cases and 298 PDFs, achievin...
-
Privacy Auditing with Zero (0) Training Run
Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
-
Architecture Determines Observability of Transformers
Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
-
Backdoor Attacks on Decentralised Post-Training
An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...
-
CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs
CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without th...
-
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints o...
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.
-
What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
-
Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies
A two-layer CRDT architecture wraps any of 26 neural network merge strategies to deliver strong eventual consistency in distributed model merging.
-
GQA-{\mu}P: The maximal parameterization update for grouped query attention
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
EndPrompt induces reliable long-context generalization in LLaMA models from sparse positional supervision via a two-segment short-sequence construction with terminal anchoring.
-
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization
Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
-
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
-
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
-
SoK: Unlearnability and Unlearning for Model Dememorization
The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
-
Entropy-informed Decoding: Adaptive Information-Driven Branching
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fi...
-
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
-
Theoretical Limits of Language Model Alignment
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
The First Token Knows: Single-Decode Confidence for Hallucination Detection
First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) acro...
-
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained an...
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation
InvariRank achieves permutation-invariant listwise reranking for LLM-based recommendations via a structured attention mask that blocks cross-candidate interactions and shared positional framing under RoPE, enabling st...
-
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior...
-
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
-
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines...
-
Can an MLP Absorb Its Own Skip Connection?
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
-
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...
-
Evaluating Temporal Consistency in Multi-Turn Language Models
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
-
Provably Secure Steganography Based on List Decoding
List decoding enables a provably secure steganography scheme with higher embedding capacity for LLMs via candidate sets and suffix matching.
-
Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation
Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020
work page 2020
-
[5]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[7]
QuAC : Question Answering in Context
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[9]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[12]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[13]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
An empirical analysis of compute-optimal large language model training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...
work page 2022
-
[15]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 8
work page 2019
-
[17]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[18]
xformers: A modular and hackable transformer modelling library
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022
work page 2022
-
[19]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Code Llama: Open Foundation Models for Code
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[22]
SocialIQA: Commonsense Reasoning about Social Interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[23]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques- tion answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[28]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[29]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023. 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.