super hub Mixed citations

Gemma 2: Improving Open Language Models at a Practical Size

Cassidy Hardin, Gemma Team: Morgane Riviere, Pier Giuseppe Sessa, Shreya Pathak, Surya Bhupatiraju · 2024 · cs.CL · arXiv 2408.00118

Mixed citation behavior. Most common role is background (64%).

243 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 243 citing papers more from Cassidy Hardin arXiv PDF

abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 23 method 6 baseline 2 dataset 1 other 1

citation-polarity summary

background 21 use method 6 unclear 3 baseline 2 use dataset 1

claims ledger

abstract In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer compe

authors

Cassidy Hardin Gemma Team: Morgane Riviere L\'eonard Hussenot Pier Giuseppe Sessa Shreya Pathak Surya Bhupatiraju

co-cited works

representative citing papers

Masked Generative Transformer Is What You Need for Image Editing

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0 · 2 refs

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

NLL-guided layer selection identifies 1/4 of layers for full attention in hybrid models, matching periodic 1/2-FA baseline accuracy on LongMemEval with Qwen3-4B while halving the full-attention compute budget.

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

cs.RO · 2026-06-26 · accept · novelty 7.0

VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

cs.AI · 2026-06-12 · unverdicted · novelty 7.0

Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

A deferral mechanism using forward-looking simulations reduces false positives in derailment forecasting by selectively waiting when recovery paths appear plausible.

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.

StakeBench: Evaluating Language Understanding Grounded in Market Commitment

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

StakeBench is a new benchmark using market-derived supervision from resolved prediction markets to test LLMs on commitment detection, side identification, action anticipation, and odds projection, revealing partial success on sides but structural failures on higher tasks.

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

SomaliBench finds large English-to-Somali refusal gaps (0.38 to 0.90) across Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B, with many Somali responses being unclear rather than compliant.

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints over shared reasoning.

Self-Improving In-Context Learning

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

math.OC · 2026-05-12 · conditional · novelty 7.0

Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.

Causal Bias Detection in Generative Artificial Intelligence

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.

citing papers explorer

Showing 43 of 243 citing papers.

Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 73 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Shared representations in brains and models reveal a two-route cortical organization during scene perception q-bio.NC · 2025-07-18 · unverdicted · none · ref 77 · internal anchor
RSA on 7T fMRI during natural scene viewing identifies ventromedial and lateral occipitotemporal representational routes for scene context versus animate content, with differential alignment to vision and language models.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 61 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Beyond Words: Multimodal LLM Knows When to Speak cs.CV · 2025-05-20 · unverdicted · none · ref 19 · internal anchor
MM-When2Speak reformulates conversational timing as dense response-type prediction and achieves up to 3x better performance by integrating video, audio, and text cues on top of an LLM backbone using a new dyadic conversation dataset.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 44 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
LLM-based User Profile Management for Recommender System cs.CL · 2025-02-20 · unverdicted · none · ref 4 · internal anchor
PURE is a three-component LLM system that extracts and maintains user profiles from reviews to outperform prior LLM recommenders on sequential Amazon tasks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 228 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought cs.CL · 2025-01-27 · unverdicted · none · ref 9 · internal anchor
AdaMCoT uses dynamic routing of chain-of-thought reasoning in intermediary languages with a reward-based selector to improve cross-lingual factual consistency in LLMs.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 139 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback cs.CL · 2024-08-28 · unverdicted · none · ref 34 · internal anchor
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation cs.CL · 2026-05-30 · unverdicted · none · ref 19 · internal anchor
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.
Ideology Prediction of German Political Texts cs.CL · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
Transformer models predict German political ideology on a continuous left-right scale, reaching F1 0.844 in-domain and MAE 0.172 on newspaper out-of-domain tests.
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation cs.IR · 2026-04-23 · unverdicted · none · ref 3 · internal anchor
A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.
Active Imitation Learning for Thermal- and Kernel-Aware LFM Inference on 3D S-NUCA Many-Cores cs.LG · 2026-04-13 · unverdicted · none · ref 34 · internal anchor
AILFM uses active imitation learning to learn thermal- and kernel-aware scheduling policies for LFM inference on 3D S-NUCA many-cores, outperforming baselines while maintaining thermal safety.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants cs.SE · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.
Unified Deployment-Aware Evaluation of Open Reasoning Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 16 · 2 links · internal anchor
A controlled multi-model evaluation on shared data subsets shows that deployment metrics and prompting choices create important tradeoffs and alter model rankings beyond accuracy alone.
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems cs.CL · 2025-07-10 · unverdicted · none · ref 18 · internal anchor
Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation cs.CL · 2025-05-07 · unverdicted · none · ref 57 · internal anchor
LLMs detect social signals in clinical transcripts across model families, with an agreement-weighted ensemble using group-level agreement patterns improving accuracy and stability over individual models.
Gemma 3 Technical Report cs.CL · 2025-03-25 · accept · none · ref 20 · internal anchor
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features cs.CV · 2025-02-20 · unverdicted · none · ref 23 · internal anchor
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilingual understanding at scales from 86M to 1B parameters.
PaliGemma 2: A Family of Versatile VLMs for Transfer cs.CV · 2024-12-04 · unverdicted · none · ref 22 · internal anchor
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs cs.AI · 2024-10-24 · unverdicted · none · ref 20 · internal anchor
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
Low-Rank Adaptation Redux for Large Models cs.LG · 2026-04-23 · unverdicted · none · ref 183 · internal anchor
An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.
Cosmos World Foundation Model Platform for Physical AI cs.CV · 2025-01-07 · unverdicted · none · ref 190 · internal anchor
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Qwen2.5 Technical Report cs.CL · 2024-12-19 · unverdicted · none · ref 18 · internal anchor
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 142 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 146 · internal anchor
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
Relational Linear Properties in Language Models: An Empirical Investigation cs.LG · 2026-05-21 · unreviewed · ref 5 · internal anchor
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs cs.AI · 2026-05-20 · unreviewed · ref 90 · internal anchor
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy cs.AI · 2026-05-20 · unreviewed · ref 3 · internal anchor
Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE) cs.LG · 2026-05-18 · unreviewed · ref 31 · internal anchor
Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence cs.LG · 2026-05-13 · unreviewed · ref 15 · internal anchor
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models cs.AI · 2026-05-05 · unreviewed · ref 21 · internal anchor
How Language Models Process Negation cs.CL · 2026-05-04 · unreviewed · ref 76 · internal anchor
SecGoal: A Benchmark for Extracting Formalizable Security Goals from Protocol Documents cs.CR · 2026-04-30 · unreviewed · ref 4 · internal anchor
Select to Think: Unlocking SLM Potential with Local Sufficiency cs.CL · 2026-04-29 · unreviewed · ref 21 · internal anchor
From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks cs.CL · 2026-04-25 · unreviewed · ref 16 · internal anchor
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models cs.CL · 2026-04-22 · unreviewed · ref 5 · internal anchor
Mechanistic Anomaly Detection via Functional Attribution cs.LG · 2026-04-21 · unreviewed · ref 74 · internal anchor
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale cs.CV · 2026-04-20 · unreviewed · ref 22 · internal anchor
Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation cs.CL · 2026-04-02 · unreviewed · ref 25 · internal anchor
When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content cs.CL · 2025-12-19 · unreviewed · ref 1 · internal anchor
Language Model Networks: Supervision-Efficient Learning through Dense Communication cs.AI · 2025-05-19 · unreviewed · ref 43 · internal anchor

Gemma 2: Improving Open Language Models at a Practical Size

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer