arxiv: 2507.06261 · v6 · submitted 2025-07-07 · 💻 cs.CL · cs.AI
Recognition: unknown
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
show 3416 more authors
Are you an author? Sign in to claim this paper.
read the original abstract
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.
-
Tracing Persona Vectors Through LLM Pretraining
cs.CL 2026-05 unverdicted novelty 8.0
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
cs.CR 2026-05 unverdicted novelty 8.0
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but inco...
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
Lost in Translation: Do LVLM Judges Generalize Across Languages?
cs.CL 2026-04 unverdicted novelty 8.0
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
cs.CL 2026-04 conditional novelty 8.0
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0
MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
cs.AI 2026-04 unverdicted novelty 8.0
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
cs.RO 2026-04 conditional novelty 8.0
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...
-
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
cs.CV 2026-05 unverdicted novelty 7.0
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
cs.CV 2026-05 conditional novelty 7.0
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
cs.CV 2026-05 conditional novelty 7.0
MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.
-
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
cs.IR 2026-05 conditional novelty 7.0
LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.
-
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
cs.IR 2026-05 conditional novelty 7.0
LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0
CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
cs.AI 2026-05 unverdicted novelty 7.0
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
-
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
-
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
cs.CV 2026-05 unverdicted novelty 7.0
VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
cs.CV 2026-05 unverdicted novelty 7.0
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
cs.CR 2026-05 unverdicted novelty 7.0
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
-
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
-
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
cs.AI 2026-05 unverdicted novelty 7.0
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
-
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
cs.CL 2026-05 unverdicted novelty 7.0
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
-
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
cs.CV 2026-05 unverdicted novelty 7.0
MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0
EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
-
Generating Complex Code Analyzers from Natural Language Questions
cs.SE 2026-05 unverdicted novelty 7.0
Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studi...
-
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction
cs.CL 2026-05 unverdicted novelty 7.0
LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-...
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
cs.CV 2026-05 unverdicted novelty 7.0
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
cs.LG 2026-05 unverdicted novelty 7.0
A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.
-
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
cs.CL 2026-05 unverdicted novelty 7.0
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
cs.CV 2026-05 unverdicted novelty 7.0
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
cs.CV 2026-05 unverdicted novelty 7.0
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
-
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.