pith. machine review for the scientific record. sign in

arxiv: 2408.00118 · v3 · submitted 2024-07-31 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team: Morgane Riviere , Shreya Pathak , Pier Giuseppe Sessa , Cassidy Hardin , Surya Bhupatiraju , L\'eonard Hussenot , Thomas Mesnard , Bobak Shahriari
show 189 more authors
Alexandre Ram\'e Johan Ferret Peter Liu Pouya Tafti Abe Friesen Michelle Casbon Sabela Ramos Ravin Kumar Charline Le Lan Sammy Jerome Anton Tsitsulin Nino Vieillard Piotr Stanczyk Sertan Girgin Nikola Momchev Matt Hoffman Shantanu Thakoor Jean-bastien Grill Behnam Neyshabur Olivier Bachem Alanna Walton Aliaksei Severyn Alicia Parrish Aliya Ahmad Allen Hutchison Alvin Abdagic Amanda Carl Amy Shen Andy Brock Andy Coenen Anthony Laforge Antonia Paterson Ben Bastian Bilal Piot Bo Wu Brandon Royal Charlie Chen Chintu Kumar Chris Perry Chris Welty Christopher A. Choquette-Choo Danila Sinopalnikov David Weinberger Dimple Vijaykumar Dominika Rogozi\'nska Dustin Herbison Elisa Bandy Emma Wang Eric Noland Erica Moreira Evan Senter Evgenii Eltyshev Francesco Visin Gabriel Rasskin Gary Wei Glenn Cameron Gus Martins Hadi Hashemi Hanna Klimczak-Pluci\'nska Harleen Batra Harsh Dhand Ivan Nardini Jacinda Mein Jack Zhou James Svensson Jeff Stanway Jetha Chan Jin Peng Zhou Joana Carrasqueira Joana Iljazi Jocelyn Becker Joe Fernandez Joost van Amersfoort Josh Gordon Josh Lipschultz Josh Newlan Ju-yeong Ji Kareem Mohamed Kartikeya Badola Kat Black Katie Millican Keelin McDonell Kelvin Nguyen Kiranbir Sodhia Kish Greene Lars Lowe Sjoesund Lauren Usui Laurent Sifre Lena Heuermann Leticia Lago Lilly McNealus Livio Baldini Soares Logan Kilpatrick Lucas Dixon Luciano Martins Machel Reid Manvinder Singh Mark Iverson Martin G\"orner Mat Velloso Mateo Wirth Matt Davidow Matt Miller Matthew Rahtz Matthew Watson Meg Risdal Mehran Kazemi Michael Moynihan Ming Zhang Minsuk Kahng Minwoo Park Mofi Rahman Mohit Khatwani Natalie Dao Nenshad Bardoliwalla Nesh Devanathan Neta Dumai Nilay Chauhan Oscar Wahltinez Pankil Botarda Parker Barnes Paul Barham Paul Michel Pengchong Jin Petko Georgiev Phil Culliton Pradeep Kuppala Ramona Comanescu Ramona Merhej Reena Jana Reza Ardeshir Rokni Rishabh Agarwal Ryan Mullins Samaneh Saadat Sara Mc Carthy Sarah Cogan Sarah Perrin S\'ebastien M. R. Arnold Sebastian Krause Shengyang Dai Shruti Garg Shruti Sheth Sue Ronstrom Susan Chan Timothy Jordan Ting Yu Tom Eccles Tom Hennigan Tomas Kocisky Tulsee Doshi Vihan Jain Vikas Yadav Vilobh Meshram Vishal Dharmadhikari Warren Barkley Wei Wei Wenming Ye Woohyun Han Woosuk Kwon Xiang Xu Zhe Shen Zhitao Gong Zichuan Wei Victor Cotruta Phoebe Kirk Anand Rao Minh Giang Ludovic Peran Tris Warkentin Eli Collins Joelle Barral Zoubin Ghahramani Raia Hadsell D. Sculley Jeanine Banks Anca Dragan Slav Petrov Oriol Vinyals Jeff Dean Demis Hassabis Koray Kavukcuoglu Clement Farabet Elena Buchatskaya Sebastian Borgeaud Noah Fiedel Armand Joulin Kathleen Kenealy Robert Dadashi Alek Andreev
Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords open language modelsGemma 2knowledge distillationlocal-global attentiongroup-query attentiontransformer architecturemodel scalingperformance benchmarks
0
0 comments X

The pith

Gemma 2 models achieve leading performance at their sizes through interleaving local-global attention, group-query attention, and knowledge distillation on the smaller variants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gemma 2 as an updated family of open language models with 2 billion to 27 billion parameters. It incorporates interleaving of local and global attention layers together with group-query attention in the Transformer backbone, while training the 2B and 9B versions via knowledge distillation rather than standard next-token prediction. These changes produce models that lead their size class on benchmarks and remain competitive with models two to three times larger. A reader would care because the work demonstrates concrete ways to extract more capability from models that fit on everyday hardware and can be released openly.

Core claim

The authors establish that applying interleaving local-global attentions and group-query attention across the model family, plus knowledge distillation for the 2B and 9B models, yields the best performance for each size and makes the models competitive alternatives to systems that are two to three times larger.

What carries the argument

The central mechanisms are the interleaving of local and global attention patterns within the Transformer layers combined with group-query attention, along with knowledge distillation applied specifically to the 2 billion and 9 billion parameter models.

If this is right

  • Open models at practical sizes can now substitute for much larger ones in many applications.
  • Hardware with modest memory can host capable language models without major quality loss.
  • Releasing the full range from 2B to 27B parameters widens access to high-performing open systems.
  • The same set of changes can be tested on future model scales to check if the efficiency pattern holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may encourage other developers to prioritize attention-pattern changes over simply adding parameters when resources are constrained.
  • Wider adoption could shift industry focus toward measuring performance per parameter rather than raw scale alone.
  • If the gains replicate across different training runs, they would support using these modifications as a standard baseline for new open models.

Load-bearing premise

The reported gains in performance come from the listed architectural modifications and the switch to distillation rather than from unreported differences in training data volume, compute budget, or evaluation setup.

What would settle it

A controlled experiment that trains identical model sizes with the same data and compute but removes the local-global interleaving and group-query attention would show whether the performance edge disappears on the same benchmarks.

read the original abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Gemma 2 family of open language models (2B, 9B, and 27B parameters). It applies known Transformer modifications including interleaving local-global attention and group-query attention, and trains the 2B and 9B models via knowledge distillation rather than next-token prediction. The central claim is that the resulting models achieve the best performance for their size and remain competitive with models 2-3 times larger; all models are released openly.

Significance. If the benchmark results are robust, the work supplies practically useful open models that advance the performance frontier at smaller scales, with the public release of weights enabling reproducibility and downstream research. This is a concrete contribution to accessible LLM development.

major comments (2)
  1. [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.
  2. [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.
minor comments (1)
  1. Ensure all benchmark tables include the exact evaluation protocols, number of runs, and any variance measures so that comparisons to 2–3× larger models can be reproduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for your review and the constructive feedback on our Gemma 2 manuscript. We address the major comments point by point below, clarifying the scope of our contributions while noting where revisions can strengthen the presentation.

read point-by-point responses
  1. Referee: [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.

    Authors: We agree that the absence of component-wise ablations with fixed data, tokens, and compute makes it difficult to isolate the contribution of each individual change. The manuscript presents the Gemma 2 models as a practical integration of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation for the smaller variants), with the central contribution being the resulting performance at these scales and the public release of the weights. We did not perform the requested ablations, as they fall outside the primary goal of delivering and evaluating the final models. In revision we will add explicit language in Sections 2–3 stating that performance gains reflect the combined system and that controlled ablations remain an avenue for future work. revision: partial

  2. Referee: [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.

    Authors: We acknowledge that qualitative descriptions alone leave open the possibility that data differences contribute to the observed gains. Gemma 2 uses an updated mixture that retains the core web, code, and math sources from Gemma 1 while increasing the proportion of high-quality mathematical and code data. Exact token counts and source proportions cannot be released for proprietary and competitive reasons. In the revised manuscript we will expand the data description in the Results section to include a qualitative comparison with the Gemma 1 mixture and to note that the architectural and distillation choices were applied on top of this updated data regime. revision: partial

standing simulated objections not resolved
  • Exact token counts, source proportions, and quantitative comparison tables for the pretraining data mixture, which cannot be disclosed due to proprietary constraints.

Circularity Check

0 steps flagged

No derivation chain present; empirical model release

full rationale

The paper introduces Gemma 2 models by describing the application of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation) and reports benchmark performance. No equations, predictions, or first-principles derivations are claimed or present in the provided text. All cited methods are external (Beltagy et al., Ainslie et al., Hinton et al.), and results are measured against independent benchmarks with models released openly. The work contains no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard Transformer assumptions plus the effectiveness of the cited modifications; no new entities are postulated and the free parameters are the usual training hyperparameters and data choices that are not enumerated in the abstract.

free parameters (2)
  • model scale choices
    Selection of 2B, 9B, and 27B parameter counts as practical sizes
  • training hyperparameters
    Learning rates, batch sizes, and distillation temperatures not specified in abstract
axioms (2)
  • domain assumption Standard Transformer attention and feed-forward blocks remain effective when modified with local-global interleaving and group-query attention
    Invoked by citing Beltagy et al. and Ainslie et al. without re-derivation
  • domain assumption Knowledge distillation improves smaller models over next-token prediction alone
    Cited from Hinton et al. and applied to 2B/9B variants

pith-pipeline@v0.9.0 · 6321 in / 1432 out tokens · 29269 ms · 2026-05-10T12:06:11.309894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Masked Generative Transformer Is What You Need for Image Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.

  2. Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

    cs.CR 2026-05 unverdicted novelty 8.0

    Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...

  3. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  4. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  5. SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents

    cs.CR 2026-04 unverdicted novelty 8.0

    The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...

  6. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  7. LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    cs.CL 2024-06 unverdicted novelty 8.0

    LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

  8. Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

    cs.RO 2026-05 unverdicted novelty 7.0

    A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

  9. Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

    math.OC 2026-05 conditional novelty 7.0

    Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

  10. Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.

  11. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  12. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

    cs.AI 2026-05 unverdicted novelty 7.0

    Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

  13. Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

    cs.LG 2026-05 unverdicted novelty 7.0

    Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.

  14. PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

    cs.LG 2026-05 unverdicted novelty 7.0

    PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.

  15. Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA

    cs.LG 2026-05 unverdicted novelty 7.0

    GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.

  16. Implicit Representations of Grammaticality in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.

  17. FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

    cs.AI 2026-05 conditional novelty 7.0

    FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.

  18. How Language Models Process Negation

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

  19. Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

    cs.SE 2026-05 unverdicted novelty 7.0

    Themis builds a multilingual benchmark and large preference dataset to train code reward models that score outputs on multiple criteria like correctness, efficiency, and style.

  20. Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

    cs.SE 2026-05 unverdicted novelty 7.0

    Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing ...

  21. E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

    cs.CR 2026-05 unverdicted novelty 7.0

    E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...

  22. Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.

  23. Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

    astro-ph.GA 2026-04 unverdicted novelty 7.0

    A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

  24. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 unverdicted novelty 7.0

    Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

  25. Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    LLMs exhibit a clear preference for Japanese culture when answering open cultural questions, with this bias emerging after supervised fine-tuning rather than during pre-training.

  26. How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

  27. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  28. MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.

  29. LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.

  30. Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

    cs.CL 2026-04 unverdicted novelty 7.0

    PIE prunes CLT features first via FAP and FAP-Synergy to match baseline circuit fidelity at lower feature budgets on IOI and Doc-String tasks, reducing interpretation costs.

  31. Conjunctive Prompt Attacks in Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

  32. Response-Aware User Memory Selection for LLM Personalization

    cs.AI 2026-04 unverdicted novelty 7.0

    RUMS selects LLM user memory via mutual information with model outputs to reduce response uncertainty, outperforming similarity-based methods in human alignment and response quality with up to 95% lower cost.

  33. Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering

    cs.IR 2026-04 unverdicted novelty 7.0

    CHR improves medical question answering retrieval by explicitly promoting evidence aligned with a correct hypothesis while penalizing content aligned with a plausible incorrect alternative.

  34. MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

    cs.LG 2026-04 conditional novelty 7.0

    Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

  35. Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

    cs.CL 2026-04 unverdicted novelty 7.0

    Multilingual LLMs display cross-lingual cultural inconsistency that a new metric quantifies and a consensus-driven preference optimization method reduces by up to 0.10 points.

  36. WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

    cs.AI 2026-03 unverdicted novelty 7.0

    WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

  37. DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

    cs.CL 2026-03 unverdicted novelty 7.0

    DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.

  38. The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

    cs.CL 2026-03 unverdicted novelty 7.0

    The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.

  39. PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

    cs.CL 2026-03 unverdicted novelty 7.0

    PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.

  40. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  41. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  42. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    cs.LG 2024-10 accept novelty 7.0

    LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.

  43. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  44. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  45. Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

    cs.LG 2026-05 unverdicted novelty 6.0

    SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

  46. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  47. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  48. ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

    cs.CL 2026-05 conditional novelty 6.0

    ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to transla...

  49. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  50. Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

  51. Domain Restriction via Multi SAE Layer Transitions

    cs.AI 2026-05 unverdicted novelty 6.0

    Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.

  52. From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

    cs.CL 2026-05 unverdicted novelty 6.0

    MedTPE compresses EHR token sequences by up to 31% via merging common medical token pairs, reducing LLM inference latency 34-63% while maintaining or improving performance on mortality and phenotyping tasks.

  53. Causal Bias Detection in Generative Artifical Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

  54. Leveraging RAG for Training-Free Alignment of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...

  55. Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Hi-GaTA is a gated temporal pyramid adapter that aggregates multi-scale video features via text-conditioned cross-attention and gated fusion to enable LLM-based surgical report generation, backed by a new 214-video be...

  56. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  57. Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

    cs.LG 2026-05 unverdicted novelty 6.0

    Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

  58. SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...

  59. Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

    cs.CL 2026-05 unverdicted novelty 6.0

    SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.

  60. Towards Generation-Efficient Uncertainty Estimation in Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 136 Pith papers · 27 internal anchors

  1. [2]

    Agarwal, N

    R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024

  2. [3]

    Llama 3 model card, 2024

    AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

  3. [5]

    Almazrouei, H

    E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models, 2023

  4. [8]

    Barham, A

    P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. Shafey, C. A. Thekkath, and Y. Wu. Pathways: Asynchronous distributed dataflow for ml, 2022

  5. [15]

    Chiang, L

    W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

  6. [18]

    Gemini: A family of highly capable multimodal models, 2023

    Gemini Team . Gemini: A family of highly capable multimodal models, 2023

  7. [19]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

    Gemini Team . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

  8. [20]

    Gemma: Open models based on gemini research and technology, 2024

    Gemma Team . Gemma: Open models based on gemini research and technology, 2024

  9. [21]

    Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

  10. [26]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023

  11. [27]

    Kahng, I

    M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif, K. Kallarackal, M. Chang, M. Terry, and L. Dixon. Llm comparator: Visual analytics for side-by-side evaluation of large language models, 2024. URL https://arxiv.org/abs/2402.10524

  12. [28]

    Evaluating language-model agents on realistic autonomous tasks

    M. Kinniment, L. J. K. Sato, H. Du, B. Goodrich, M. Hasin, L. Chan, L. H. Miles, T. R. Lin, H. Wijk, J. Burget, A. Ho, E. Barnes, and P. Christiano. Evaluating language-model agents on realistic autonomous tasks, 2024. URL https://arxiv.org/abs/2312.11671

  13. [32]

    Z. Lin, J. Cui, X. Liao, and X. Wang. Malla: Demystifying real-world large language model integrated malicious services, 2024. URL https://arxiv.org/abs/2401.03315

  14. [34]

    Personal Communication, 2024

    Macknight, Aung, and Gomes. Personal Communication, 2024

  15. [35]

    Towards agile text classifiers for everyone, 2023

    M. Mozes, J. Hoffmann, K. Tomanek, M. Kouate, N. Thain, A. Yuan, T. Bolukbasi, and L. Dixon. Towards agile text classifiers for everyone, 2023. URL https://arxiv.org/abs/2302.06541

  16. [37]

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

    M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V. Krakovna, D. Lindner, M. Rahtz, Y. Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabilit...

  17. [38]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019

  18. [40]

    A. Ramé, J. Ferret, N. Vieillard, R. Dadashi, L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin, A. Douillard, and O. Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024

  19. [41]

    J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. \ Zero-offload \ : Democratizing \ billion-scale \ model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551--564, 2021

  20. [42]

    Roberts, H

    A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, et al. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24 0 (377): 0 1--8, 2023

  21. [45]

    arXiv preprint arXiv:2305.15324 , year=

    T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal, M. Anderljung, N. Kolt, L. Ho, D. Siddarth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, V. Bolina, J. Clark, Y. Bengio, P. Christiano, and A. Dafoe. Model evaluation for extreme risks, 2023. URL https://arxiv.org/abs/2305.15324

  22. [47]

    Suzgun, N

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022

  23. [48]

    Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1.5/

  24. [49]

    Tenney, J

    I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, and A. Yuan. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models, 2020. URL https://arxiv.org/abs/2008.05122

  25. [50]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023

  26. [53]

    grok-1, 2024

    xAI. grok-1, 2024. URL https://github.com/xai-org/grok-1

  27. [54]

    Xla: Optimizing compiler for tensorflow, 2019

    XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla

  28. [56]

    J. Yang, A. Prabhakar, K. Narasimhan, and S. Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback, 2023. URL https://arxiv.org/abs/2306.14898

  29. [59]

    Neural Combinatorial Optimization with Reinforcement Learning

    Irwan Bello and Hieu Pham and Quoc V. Le and Mohammad Norouzi and Samy Bengio , title =. CoRR , volume =. 2016 , url =. 1611.09940 , timestamp =

  30. [60]

    Concrete problems in

    Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in. arXiv preprint , year =

  31. [61]

    Quantifying Memorization Across Neural Language Models

    Quantifying memorization across neural language models , author=. arXiv preprint arXiv:2202.07646 , year=

  32. [62]

    Feder Cooper, Daphne Ippolito, Christopher A

    Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

  33. [63]

    30th USENIX Security Symposium (USENIX Security 21) , pages=

    Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=

  34. [64]

    arXiv preprint arXiv:2210.17546 , year=

    Preventing verbatim memorization in language models gives a false sense of privacy , author=. arXiv preprint arXiv:2210.17546 , year=

  35. [65]

    arXiv preprint arXiv:2309.04662 , year=

    Madlad-400: A multilingual and document-level large audited dataset , author=. arXiv preprint arXiv:2309.04662 , year=

  36. [66]

    NeurIPS , year =

    Defining and Characterizing Reward Gaming , author =. NeurIPS , year =

  37. [67]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  38. [68]

    2022 , eprint=

    Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=

  39. [69]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=

  40. [70]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

  41. [72]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

  42. [73]

    Mastering the game of

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

  43. [74]

    Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

    Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings , author=. Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

  44. [75]

    2023 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

  45. [76]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1911.11641 , timestamp =

  46. [77]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09728 , timestamp =

  47. [78]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark and Kenton Lee and Ming. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , journal =. 2019 , url =. 1905.10044 , timestamp =

  48. [79]

    https://aclanthology.org/ Q19-1026/

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

  49. [80]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2020 , url =. 2009.03300 , timestamp =

  50. [81]

    Program Synthesis with Large Language Models

    Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

  51. [82]

    Language Models are Unsupervised Multitask Learners , author=

  52. [83]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  53. [84]

    Evaluating Large Language Models Trained on Code

    Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

  54. [85]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =

  55. [86]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1907.10641 , timestamp =

  56. [87]

    Denis Paperno and Germ. The. CoRR , volume =. 2016 , url =. 1606.06031 , timestamp =

  57. [88]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , title =. CoRR , volume =. 2017 , url =. 1705.03551 , timestamp =

  58. [89]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  59. [90]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  60. [91]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  61. [92]

    2023 , eprint=

    The Falcon Series of Open Language Models , author=. 2023 , eprint=

  62. [93]

    Textbooks Are All You Need II: phi-1.5 technical report

    Textbooks are all you need ii: phi-1.5 technical report , author=. arXiv preprint arXiv:2309.05463 , year=

  63. [94]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  64. [95]

    Sequence to Sequence Learning with Neural Networks

    Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. CoRR , volume =. 2014 , url =. 1409.3215 , timestamp =

  65. [96]

    Attention Is All You Need

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. CoRR , volume =. 2017 , url =. 1706.03762 , timestamp =

  66. [97]

    nature , volume=

    Deep learning , author=. nature , volume=. 2015 , publisher=

  67. [98]

    2022 , eprint=

    Pathways: Asynchronous Distributed Dataflow for ML , author=. 2022 , eprint=

  68. [99]

    Journal of Machine Learning Research , volume=

    Scaling up models and data with t5x and seqio , author=. Journal of Machine Learning Research , volume=

  69. [100]

    2019 , url=

    XLA: Optimizing compiler for TensorFlow , author=. 2019 , url=

  70. [101]

    2022 , publisher=

    How our principles helped define AlphaFold’s release , author=. 2022 , publisher=

  71. [102]

    Large Scale Distributed Deep Networks , url =

    Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

  72. [103]

    Efficient Estimation of Word Representations in Vector Space , booktitle =

    Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =. 2013 , url =

  73. [104]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

  74. [105]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

  75. [106]

    Adam Roberts and Hyung Won Chung and Anselm Levskaya and Gaurav Mishra and James Bradbury and Daniel Andor and Sharan Narang and Brian Lester and Colin Gaffney and Afroz Mohiuddin and Curtis Hawthorne and Aitor Lewkowycz and Alex Salcianu and Marc van Zee and Jacob Austin and Sebastian Goodman and Livio Baldini Soares and Haitang Hu and Sasha Tsvyashchenk...

  76. [107]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer , title =. CoRR , volume =. 2019 , url =. 1911.02150 , timestamp =

  77. [108]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu , title =. CoRR , volume =. 2021 , url =. 2104.09864 , timestamp =

  78. [109]

    2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=

    \ Zero-offload \ : Democratizing \ billion-scale \ model training , author=. 2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=

  79. [110]

    GLU Variants Improve Transformer

    Noam Shazeer , title =. CoRR , volume =. 2020 , url =. 2002.05202 , timestamp =

  80. [111]

    Available: https://arxiv.org/abs/1910.07467

    Biao Zhang and Rico Sennrich , title =. CoRR , volume =. 2019 , url =. 1910.07467 , timestamp =

Showing first 80 references.