pith. sign in

arxiv: 2310.06825 · v1 · submitted 2023-10-10 · 💻 cs.CL · cs.AI· cs.LG

Mistral 7B

Pith reviewed 2026-05-24 06:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords Mistral 7Blanguage modelgrouped-query attentionsliding window attentionbenchmarksLlama comparisoninstruction tuningmodel release
0
0 comments X

The pith

A 7-billion-parameter model outperforms Llama 2 13B on every benchmark and Llama 1 34B on reasoning, mathematics, and code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mistral 7B, a 7B-parameter language model that beats substantially larger models on standard evaluations. It attributes the gains to two attention changes: grouped-query attention for quicker inference and sliding window attention for long sequences at lower cost. An instruction-tuned variant also exceeds Llama 2 13B Chat on both human and automatic checks. The models are released under Apache 2.0. The central point is that targeted architectural choices can deliver higher performance than simply increasing parameter count.

Core claim

Mistral 7B v0.1 is a 7-billion-parameter model that uses grouped-query attention and sliding window attention to outperform Llama 2 13B across all evaluated benchmarks and Llama 1 34B in reasoning, mathematics, and code generation. The instruction-tuned Mistral 7B Instruct version surpasses Llama 2 13B Chat on both human and automated benchmarks. The models are released under the Apache 2.0 license.

What carries the argument

Grouped-query attention paired with sliding window attention, which together reduce inference cost while supporting long sequences.

If this is right

  • Models of 7B parameters can exceed 13B-parameter models on common tasks when attention mechanisms are adjusted.
  • Inference speed improves without sacrificing sequence length handling.
  • Instruction tuning applied to the base model produces a chat version stronger than the corresponding Llama 2 variant.
  • Open release under Apache 2.0 allows direct use and further fine-tuning by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Efficiency-focused designs may shift emphasis from raw scale toward architecture in future model development.
  • Smaller models become more practical for on-device or low-resource deployment.
  • Continued benchmark saturation could prompt creation of harder, more leakage-resistant evaluation sets.

Load-bearing premise

The chosen benchmarks and test sets measure genuine downstream usefulness and contain no overlap with the training data.

What would settle it

A new benchmark suite free of training-data overlap on which Mistral 7B scores below Llama 2 13B, or direct evidence that the original test sets leaked into training.

Figures

Figures reproduced from arXiv: 2310.06825 by Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, L\'elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\'ee Lacroix, William El Sayed.

Figure 1
Figure 1. Figure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most W tokens from the previous layer (here, W = … view at source ↗
Figure 2
Figure 2. Figure 2: Rolling buffer cache. The cache has a fixed size of W = 4. Keys and values for position i are stored in position i mod W of the cache. When the position i is larger than W, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange. Pre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is con… view at source ↗
Figure 3
Figure 3. Figure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, “The cat sat on”, “the mat and saw”, “the dog go to”. The figure shows what happens for the third chunk (“the dog go to”): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past to… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Human evaluation of Mistral 7B – Instruct vs Llama 2 13B – Chat Example. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Mistral 7B, a 7B-parameter language model that outperforms Llama 2 13B across all evaluated benchmarks and Llama 1 34B on reasoning, mathematics, and code generation. It employs grouped-query attention (GQA) and sliding window attention (SWA) for inference efficiency and long-sequence handling. An instruction-tuned variant (Mistral 7B -- Instruct) is also presented and shown to surpass Llama 2 13B -- Chat on human and automated benchmarks. The models are released under Apache 2.0.

Significance. If the empirical results hold, the work demonstrates that targeted architectural choices can enable smaller models to match or exceed larger ones on standard tasks, with direct implications for efficient deployment. The open release of weights supports independent verification and extension, which strengthens the contribution.

major comments (1)
  1. [Evaluation] Evaluation section: the headline claim that Mistral 7B outperforms Llama 2 13B on every reported benchmark rests entirely on the numerical scores, yet the manuscript provides no description of n-gram decontamination, membership-inference checks, or confirmation that the test sets (MMLU, HumanEval, GSM8K, etc.) were held out from training data. This directly affects the reliability of the reported margins.
minor comments (2)
  1. No training corpus composition, token count, or hyperparameter details are supplied, limiting reproducibility and contextualization of the efficiency claims.
  2. Benchmark tables lack error bars or multiple-run statistics, making it impossible to assess whether the observed differences are statistically meaningful.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation section. We address the concern below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the headline claim that Mistral 7B outperforms Llama 2 13B on every reported benchmark rests entirely on the numerical scores, yet the manuscript provides no description of n-gram decontamination, membership-inference checks, or confirmation that the test sets (MMLU, HumanEval, GSM8K, etc.) were held out from training data. This directly affects the reliability of the reported margins.

    Authors: We agree that the manuscript lacks explicit details on these points, which is a valid concern for transparency. In the revised version we will add a dedicated paragraph in the Evaluation section describing our internal data curation process, including n-gram overlap checks performed to reduce contamination with the listed benchmarks and confirmation that the test sets were excluded from training. We did not run membership-inference attacks, as they are not standard practice in the majority of contemporaneous LLM papers and would require substantial additional compute; we will note this limitation explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model release with benchmark comparisons

full rationale

The manuscript introduces Mistral 7B, describes architectural choices (GQA, SWA), and reports empirical benchmark results against prior models. No derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear. The central claim is a direct empirical comparison of released weights; it is self-contained against external benchmarks and contains no steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical model release rather than a derivation. No free parameters are fitted inside a mathematical claim; the only implicit assumptions are standard supervised language-model training and the validity of the chosen benchmarks.

pith-pipeline@v0.9.0 · 5728 in / 1048 out tokens · 22675 ms · 2026-05-24T06:09:42.492805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

    cs.CL 2026-05 accept novelty 8.0

    RTI-Bench is the first publicly released structured dataset of CIC administrative decisions with outcome labels, exemption citations, IRAC reasoning, and timelines, built from 1,218 corpus cases and 298 PDFs, achievin...

  2. Privacy Auditing with Zero (0) Training Run

    cs.CR 2026-05 unverdicted novelty 8.0

    Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

  3. Crafting Reversible SFT Behaviors in Large Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

  4. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 conditional novelty 8.0

    HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.

  5. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 conditional novelty 8.0

    INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

  6. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  7. Backdoor Attacks on Decentralised Post-Training

    cs.CR 2026-03 conditional novelty 8.0

    An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...

  8. CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs

    cs.CR 2025-11 conditional novelty 8.0

    CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without th...

  9. MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

    cs.CL 2025-07 accept novelty 8.0

    MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.

  10. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  11. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  12. LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    cs.CL 2024-06 unverdicted novelty 8.0

    LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

  13. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  14. Evaluating Very Long-Term Conversational Memory of LLM Agents

    cs.CL 2024-02 unverdicted novelty 8.0

    Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

  15. Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints o...

  16. Layer-wise Token Compression for Efficient Document Reranking

    cs.IR 2026-05 unverdicted novelty 7.0

    Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...

  17. Layer-wise Token Compression for Efficient Document Reranking

    cs.IR 2026-05 conditional novelty 7.0

    Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.

  18. What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

    cs.CL 2026-05 accept novelty 7.0

    A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.

  19. Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

    cs.CR 2026-05 unverdicted novelty 7.0

    Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.

  20. Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies

    cs.DC 2026-05 unverdicted novelty 7.0

    A two-layer CRDT architecture wraps any of 26 neural network merge strategies to deliver strong eventual consistency in distributed model merging.

  21. GQA-{\mu}P: The maximal parameterization update for grouped query attention

    cs.LG 2026-05 unverdicted novelty 7.0

    Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

  22. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

  23. EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

    cs.CL 2026-05 conditional novelty 7.0

    EndPrompt induces reliable long-context generalization in LLaMA models from sparse positional supervision via a two-segment short-sequence construction with terminal anchoring.

  24. GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

    cs.LG 2026-05 unverdicted novelty 7.0

    GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.

  25. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  26. TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

    cs.CL 2026-05 unverdicted novelty 7.0

    TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

  27. Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

    math.OC 2026-05 conditional novelty 7.0

    Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

  28. Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.

  29. BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

    cs.AI 2026-05 conditional novelty 7.0

    BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.

  30. SoK: Unlearnability and Unlearning for Model Dememorization

    cs.LG 2026-05 conditional novelty 7.0

    The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.

  31. Deep Minds and Shallow Probes

    cs.LG 2026-05 unverdicted novelty 7.0

    Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

  32. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

  33. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

    cs.AI 2026-05 unverdicted novelty 7.0

    Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

  34. Entropy-informed Decoding: Adaptive Information-Driven Branching

    cs.LG 2026-05 unverdicted novelty 7.0

    EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fi...

  35. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

    cs.LG 2026-05 unverdicted novelty 7.0

    ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

  36. Theoretical Limits of Language Model Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  37. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  38. The First Token Knows: Single-Decode Confidence for Hallucination Detection

    cs.CL 2026-05 unverdicted novelty 7.0

    First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) acro...

  39. Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs

    cs.LG 2026-05 unverdicted novelty 7.0

    Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...

  40. Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...

  41. FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

    cs.AI 2026-05 conditional novelty 7.0

    FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.

  42. The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

    cs.LG 2026-05 unverdicted novelty 7.0

    Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.

  43. The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

    cs.LG 2026-05 accept novelty 7.0

    Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained an...

  44. How Language Models Process Negation

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

  45. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...

  46. Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

    cs.CL 2026-05 unverdicted novelty 7.0

    EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.

  47. A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

    cs.CL 2026-05 unverdicted novelty 7.0

    Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

  48. ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

    cs.CL 2026-05 unverdicted novelty 7.0

    Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.

  49. Attention Is Where You Attack

    cs.CR 2026-04 unverdicted novelty 7.0

    ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

  50. One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation

    cs.IR 2026-04 conditional novelty 7.0

    InvariRank achieves permutation-invariant listwise reranking for LLM-based recommendations via a structured attention mask that blocks cross-candidate interactions and shared positional framing under RoPE, enabling st...

  51. Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

    cs.LG 2026-04 unverdicted novelty 7.0

    KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior...

  52. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

  53. Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

    cs.LG 2026-04 conditional novelty 7.0

    COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines...

  54. Can an MLP Absorb Its Own Skip Connection?

    cs.LG 2026-04 accept novelty 7.0

    Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.

  55. Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

    cs.LG 2026-04 unverdicted novelty 7.0

    In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.

  56. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 unverdicted novelty 7.0

    Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

  57. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 conditional novelty 7.0

    A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...

  58. Evaluating Temporal Consistency in Multi-Turn Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

  59. Provably Secure Steganography Based on List Decoding

    cs.CR 2026-04 conditional novelty 7.0

    List decoding enables a provably secure steganography scheme with higher embedding capacity for LLMs via candidate sets and suffix matching.

  60. Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

    cs.CL 2026-04 conditional novelty 7.0

    Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 456 Pith papers · 21 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  4. [4]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  7. [7]

    QuAC : Question Answering in Context

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018

  8. [8]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  10. [10]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  11. [11]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022

  12. [12]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  13. [13]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  14. [14]

    An empirical analysis of compute-optimal large language model training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

  15. [15]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

  16. [16]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 8

  17. [17]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  18. [18]

    xformers: A modular and hackable transformer modelling library

    Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022

  19. [19]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018

  20. [20]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  21. [21]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

  22. [22]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019

  23. [23]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

  24. [24]

    CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques- tion answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018

  25. [25]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  26. [26]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  27. [27]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  28. [28]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

  29. [29]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023. 9