super hub Mixed citations

Gemma 3 Technical Report

Gemma Team · 2025 · cs.CL · arXiv 2503.19786

Mixed citation behavior. Most common role is background (70%).

499 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 499 citing papers more from Gemma Team arXiv PDF

abstract

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 baseline 10 other 4 dataset 3 method 2

citation-polarity summary

background 48 baseline 10 unclear 6 use dataset 3 use method 2

claims ledger

abstract We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gem

authors

Gemma Team

co-cited works

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

cs.AI · 2026-06-02 · unverdicted · novelty 8.0

MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.

Looped Transformers with Layer Normalization Provably Learn the Power Method

cs.LG · 2026-05-30 · unverdicted · novelty 8.0

Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

cs.CL · 2026-04-21 · conditional · novelty 8.0

SAHM is the first Arabic financial benchmark with seven tasks including AAOIFI standards QA, fatwa reasoning, accounting exams, sentiment analysis, summarization, and event-cause reasoning, showing that Arabic fluency does not imply strong financial reasoning in 20 tested LLMs.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Neural Signals Generate Clinical Notes in the Wild

cs.LG · 2026-01-29 · unverdicted · novelty 8.0

CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

cs.RO · 2026-06-30 · accept · novelty 7.0

RCT dataset with sequence-preserving splits demonstrates that tactile-to-text models achieve only 25.1% Recall@1 on held-out materials, exposing generalization as the core challenge.

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.

Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Introduces MMBU benchmark for VLMs in biomedicine and demonstrates that established benchmarks mask perception deficiencies in evaluated models.

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

Text-to-Image Models Need Less from Text Encoders Than You Think

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A bag-of-position-tagged-words embedding guides text-to-image diffusion models as effectively as full contextual text embeddings from standard encoders.

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

SEA-NLI benchmark shows low performance across 17 LLMs on Southeast Asian cultural NLI, mainly due to missing cultural knowledge, with gains from SEA-adapted models and culture-aware prompting.

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthromorphism, and Maxims

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Presents FRANZ framework and SQUARE corpus for multi-dimensional audit of LLM response framing on subjective cultural queries, applied to three models to reveal differences and couplings.

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

MERIT enables decentralized instruction tuning via conflict-aware PCA splitting and parameter-space merging, raising average benchmark scores above joint training on multimodal and text mixtures.

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

citing papers explorer

Showing 2 of 2 citing papers after filters.

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV · 2026-04-28 · conditional · none · ref 40 · 2 links · internal anchor
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
LTX-2: Efficient Joint Audio-Visual Foundation Model cs.CV · 2026-01-06 · conditional · none · ref 27 · internal anchor
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

Gemma 3 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer