super hub Mixed citations

Gemma 3 Technical Report

Gemma Team · 2025 · cs.CL · arXiv 2503.19786

Mixed citation behavior. Most common role is background (70%).

484 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 484 citing papers more from Gemma Team arXiv PDF

abstract

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 baseline 10 other 4 dataset 3 method 2

citation-polarity summary

background 48 baseline 10 unclear 6 use dataset 3 use method 2

claims ledger

abstract We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gem

authors

Gemma Team

co-cited works

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

cs.AI · 2026-06-02 · unverdicted · novelty 8.0

MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.

Looped Transformers with Layer Normalization Provably Learn the Power Method

cs.LG · 2026-05-30 · unverdicted · novelty 8.0

Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

cs.CL · 2026-04-21 · conditional · novelty 8.0

SAHM is the first Arabic financial benchmark with seven tasks including AAOIFI standards QA, fatwa reasoning, accounting exams, sentiment analysis, summarization, and event-cause reasoning, showing that Arabic fluency does not imply strong financial reasoning in 20 tested LLMs.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Neural Signals Generate Clinical Notes in the Wild

cs.LG · 2026-01-29 · unverdicted · novelty 8.0

CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

cs.RO · 2026-06-30 · accept · novelty 7.0

RCT dataset with sequence-preserving splits demonstrates that tactile-to-text models achieve only 25.1% Recall@1 on held-out materials, exposing generalization as the core challenge.

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.

Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.

Text-to-Image Models Need Less from Text Encoders Than You Think

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A bag-of-position-tagged-words embedding guides text-to-image diffusion models as effectively as full contextual text embeddings from standard encoders.

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

SEA-NLI benchmark shows low performance across 17 LLMs on Southeast Asian cultural NLI, mainly due to missing cultural knowledge, with gains from SEA-adapted models and culture-aware prompting.

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthromorphism, and Maxims

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Presents FRANZ framework and SQUARE corpus for multi-dimensional audit of LLM response framing on subjective cultural queries, applied to three models to reveal differences and couplings.

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

MERIT enables decentralized instruction tuning via conflict-aware PCA splitting and parameter-space merging, raising average benchmark scores above joint training on multimodal and text mixtures.

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.

citing papers explorer

Showing 50 of 412 citing papers after filters.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning cs.CV · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback cs.LG · 2026-06-29 · unverdicted · none · ref 29 · internal anchor
Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents cs.AI · 2026-06-02 · unverdicted · none · ref 30 · internal anchor
MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.
Looped Transformers with Layer Normalization Provably Learn the Power Method cs.LG · 2026-05-30 · unverdicted · none · ref 36 · internal anchor
Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.
Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning cs.CL · 2026-04-21 · conditional · none · ref 1 · internal anchor
SAHM is the first Arabic financial benchmark with seven tasks including AAOIFI standards QA, fatwa reasoning, accounting exams, sentiment analysis, summarization, and event-cause reasoning, showing that Arabic fluency does not imply strong financial reasoning in 20 tested LLMs.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 85 · internal anchor
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents cs.CV · 2026-03-16 · accept · none · ref 4 · internal anchor
VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
Neural Signals Generate Clinical Notes in the Wild cs.LG · 2026-01-29 · unverdicted · none · ref 7 · internal anchor
CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 136 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA cs.AI · 2026-06-30 · unverdicted · none · ref 9 · internal anchor
Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.
RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization cs.RO · 2026-06-30 · accept · none · ref 26 · internal anchor
RCT dataset with sequence-preserving splits demonstrates that tactile-to-text models achieve only 25.1% Recall@1 on held-out materials, exposing generalization as the core challenge.
Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity? cs.CV · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.
Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge cs.CV · 2026-06-25 · unverdicted · none · ref 64 · internal anchor
LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.
Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models cs.LG · 2026-06-23 · unverdicted · none · ref 11 · internal anchor
PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.
Text-to-Image Models Need Less from Text Encoders Than You Think cs.CV · 2026-06-02 · unverdicted · none · ref 30 · internal anchor
A bag-of-position-tagged-words embedding guides text-to-image diffusion models as effectively as full contextual text embeddings from standard encoders.
SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding cs.CL · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
SEA-NLI benchmark shows low performance across 17 LLMs on Southeast Asian cultural NLI, mainly due to missing cultural knowledge, with gains from SEA-adapted models and culture-aware prompting.
GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving cs.CV · 2026-06-01 · unverdicted · none · ref 34 · internal anchor
GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.
Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthromorphism, and Maxims cs.CL · 2026-06-01 · unverdicted · none · ref 4 · internal anchor
Presents FRANZ framework and SQUARE corpus for multi-dimensional audit of LLM response framing on subjective cultural queries, applied to three models to reveal differences and couplings.
Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging cs.LG · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
MERIT enables decentralized instruction tuning via conflict-aware PCA splitting and parameter-space merging, raising average benchmark scores above joint training on multimodal and text mixtures.
Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning cs.CV · 2026-06-01 · unverdicted · none · ref 46 · internal anchor
Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 18 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
Subliminal Learning Is Steering Vector Distillation cs.AI · 2026-05-31 · unverdicted · none · ref 36 · internal anchor
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings cs.CL · 2026-05-29 · unverdicted · none · ref 45 · internal anchor
Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.
LLMs Need Encoders for Semantic IDs Too cs.IR · 2026-05-29 · unverdicted · none · ref 18 · internal anchor
PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.
FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning cs.AI · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
FAM-Bench introduces 2500 nutrition-expert-verified multimodal instances across 13 conditions for dish suitability assessment and comparative ranking tasks.
The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models cs.LG · 2026-05-29 · unverdicted · none · ref 22 · internal anchor
LLM residual streams during addition form an Iso-Raw-Sum Trajectory anchored by digit semantics and modulated by continuous carry signals, with errors arising as geometric slippages across quantization thresholds in a noisy model.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 31 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations cs.CV · 2026-05-28 · unverdicted · none · ref 15 · internal anchor
CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.
AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference cs.LG · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
AsymVLM introduces asymmetric token pruning for vision and text in VLMs to deliver up to 54% FLOPs reduction while matching or exceeding prior methods on localized visual tasks.
Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Kronecker Embeddings replace learned embedding tables with a deterministic byte-level character-position factorization and single projection, reducing parameters over 90% with reported gains in loss and robustness on language modeling tasks.
BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base cs.CL · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
BrahmicTokenizer-131K is a 131K-vocab tokenizer constructed via script-prune crop and linear-programming retrofit to o200k_base, achieving 26.7% fewer tokens on Indic text while matching o200k_base on English fertility and outperforming alternatives on code/math benchmarks.
Sense Representations Are Inducible Interfaces cs.CL · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
ACROS induces explicit sense representations in frozen decoder LMs via gated residual addition, enabling competitive zero-shot WSD, lexical steering, and cross-lingual adaptation on SmolLM2-360M while preserving base quality.
Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing cs.LG · 2026-05-27 · unverdicted · none · ref 7 · internal anchor
SAEs used for layer selection with raw task vectors outperform subspace projection and raise math reasoning accuracy on Gemma-3-4B-IT.
Towards Cost-effective LLMs Routing with Batch Prompting cs.DB · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness cs.CL · 2026-05-27 · unverdicted · none · ref 19 · internal anchor
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling cs.AI · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
OmniToM is a new benchmark for Theory of Mind in LLMs that evaluates explicit belief extraction and seven-dimensional labeling from 895 stories, revealing an actor-specific belief-tracking bottleneck.
PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction cs.CV · 2026-05-23 · unverdicted · none · ref 12 · internal anchor
PedestrianQA is a new benchmark that turns pedestrian behavior prediction into VLM question-answering with rationales, reporting improved intention classification, trajectory accuracy, and explanation quality after fine-tuning on multiple existing video datasets.
From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain cs.CV · 2026-05-22 · unverdicted · none · ref 41 · internal anchor
BrainCause recovers known visual localizations and finds new candidate representations by validating causal specificity via counterfactual stimuli and encoding models, showing activation alone produces many false positives.
Understanding Data Temporality Impact on Large Language Models Pre-training cs.CL · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
Pre-training 6B LLMs on temporally ordered Common Crawl snapshots yields models with improved factual freshness and temporal precision over shuffled baselines while matching on general language understanding.
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU cs.DC · 2026-05-20 · conditional · none · ref 62 · internal anchor
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? cs.CL · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding cs.LG · 2026-05-18 · unverdicted · none · ref 61 · internal anchor
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing cs.LG · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting cs.CL · 2026-05-18 · unverdicted · none · ref 10 · 2 links · internal anchor
Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance nlin.AO · 2026-05-17 · unverdicted · none · ref 2 · internal anchor
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 11 · 2 links · internal anchor
LEAP learns unstructured pruning masks end-to-end for LLMs via Gumbel-sigmoid Bernoulli relaxation and reports +2.59 average zero-shot accuracy gain over ADMM at 50-60% sparsity across five model families.
To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents cs.LG · 2026-05-16 · conditional · none · ref 9 · internal anchor
LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.
Artificial Aphasias in Lesioned Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 16 · internal anchor
Lesioning parameters in large language models produces aphasia-like symptoms whose distributions vary by attention versus feed-forward components and by layer depth, but differ qualitatively from human clinical profiles.
$\phi$-Balancing for Mixture-of-Experts Training cs.LG · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

Gemma 3 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer