super hub Mixed citations

Gemma 3 Technical Report

Gemma Team · 2025 · cs.CL · arXiv 2503.19786

Mixed citation behavior. Most common role is background (70%).

520 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 520 citing papers more from Gemma Team arXiv PDF

abstract

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 baseline 10 other 4 dataset 3 method 2

citation-polarity summary

background 48 baseline 10 unclear 6 use dataset 3 use method 2

claims ledger

abstract We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gem

authors

Gemma Team

co-cited works

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

cs.AI · 2026-06-02 · unverdicted · novelty 8.0

MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.

Looped Transformers with Layer Normalization Provably Learn the Power Method

cs.LG · 2026-05-30 · unverdicted · novelty 8.0

Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

cs.CL · 2026-04-21 · conditional · novelty 8.0

SAHM is the first Arabic financial benchmark with seven tasks including AAOIFI standards QA, fatwa reasoning, accounting exams, sentiment analysis, summarization, and event-cause reasoning, showing that Arabic fluency does not imply strong financial reasoning in 20 tested LLMs.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Neural Signals Generate Clinical Notes in the Wild

cs.LG · 2026-01-29 · unverdicted · novelty 8.0

CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

cs.RO · 2026-06-30 · accept · novelty 7.0

RCT dataset with sequence-preserving splits demonstrates that tactile-to-text models achieve only 25.1% Recall@1 on held-out materials, exposing generalization as the core challenge.

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.

Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

A History-Aware Visually Grounded Critic for Computer Use Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.

When Correct Decisions Hide Internal Stress: Decision-State Probing in Multimodal Language Models

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

S³E framework finds excess decision-state displacement under semantic stress in multimodal models despite consistent correct forced-choice behavior.

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

cs.AI · 2026-06-06 · accept · novelty 7.0

MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

Sci-Rho is a dynamic multilingual visually-grounded symbolic benchmark for STEM problems that reveals robustness gaps in current VLMs between average and worst-case performance.

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.

citing papers explorer

Showing 14 of 14 citing papers after filters.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 136 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models cs.CV · 2026-04-28 · conditional · none · ref 38 · internal anchor
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.
Can Multimodal Large Language Models Truly Understand Small Objects? cs.CV · 2026-04-24 · unverdicted · none · ref 35 · internal anchor
Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models cs.CV · 2026-03-31 · unverdicted · none · ref 12 · internal anchor
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 53 · internal anchor
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 60 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts cs.CV · 2026-05-03 · unverdicted · none · ref 35 · internal anchor
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 56 · internal anchor
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation cs.CV · 2026-04-20 · unverdicted · none · ref 39 · internal anchor
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
Boosting Visual Instruction Tuning with Self-Supervised Guidance cs.CV · 2026-04-14 · unverdicted · none · ref 39 · internal anchor
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation cs.CV · 2026-02-17 · unverdicted · none · ref 110 · internal anchor
MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models cs.CV · 2026-05-12 · unverdicted · none · ref 3 · 3 links · internal anchor
GAP aligns visual latent reasoning in MLLMs via PCA-mapped decoder outputs, auxiliary visual supervision, and selective capacity-guided training, yielding top supervised performance on a 7B model with evidence that latents carry task-relevant signal.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification cs.CV · 2026-04-06 · unverdicted · none · ref 53 · internal anchor
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 201 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

Gemma 3 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer