mega hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

1004 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 1004 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

citing papers explorer

Showing 32 of 32 citing papers after filters.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views cs.RO · 2026-04-03 · conditional · none · ref 3 · internal anchor
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors cs.RO · 2026-06-26 · conditional · none · ref 57 · internal anchor
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI cs.RO · 2026-05-07 · unverdicted · none · ref 12 · 2 links · internal anchor
RobotEQ is a new benchmark dataset and evaluation suite showing that current embodied AI models fall short on active social-norm compliance, especially spatial grounding, though RAG with external knowledge helps.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G cs.RO · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
Telecom World Models introduce a three-layer architecture for learned, action-conditioned, uncertainty-aware modeling of 6G network dynamics, combining digital twins and foundation models, with a network slicing proof-of-concept showing improved KPI prediction over baselines.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 34 · internal anchor
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 18 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 21 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation cs.RO · 2026-06-30 · unverdicted · none · ref 9 · internal anchor
UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation cs.RO · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
Verifiable Foundation Models for Robot Safety cs.RO · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
FEARL decomposes robot policies into an expressive Controller and a small verifiable Safety module to enable formal verification of safety constraints while retaining foundation-model task performance.
Robot Critics that Sweat the Small Stuff cs.RO · 2026-06-19 · unverdicted · none · ref 7 · internal anchor
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
Inductive Generalization for Robotic Manipulation cs.RO · 2026-06-19 · unverdicted · none · ref 9 · internal anchor
The paper introduces an inductive generalization evaluation protocol for manipulation policies and shows that SOTA vision-language-action models fail on progressively harder task variants.
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents cs.RO · 2026-06-09 · unverdicted · none · ref 26 · internal anchor
A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.
Autonomous Frontier-Based Exploration with VLM Guidance cs.RO · 2026-05-22 · unverdicted · none · ref 23 · internal anchor
A VLM-based method for selecting exploration frontiers in robotics achieves up to 24% better map coverage than standard geometric heuristics in simulated indoor environments.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 16 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Long-Horizon Manipulation via Trace-Conditioned VLA Planning cs.RO · 2026-04-23 · unverdicted · none · ref 14 · internal anchor
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 42 · internal anchor
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding cs.RO · 2025-11-21 · unverdicted · none · ref 9 · internal anchor
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation cs.RO · 2025-11-21 · unverdicted · none · ref 9 · internal anchor
Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail cs.RO · 2025-10-30 · conditional · none · ref 11 · internal anchor
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
Change-Robust Online Spatial-Semantic Topological Mapping cs.RO · 2026-05-04 · unverdicted · none · ref 9
CROSS replaces globally consistent metric maps with a pose-aware topological graph of RGB-D keyframes and maintains a bounded Gaussian-mixture belief over poses via sequential hypothesis testing in SE(3) to achieve change-robust spatial-semantic mapping and navigation.
Make Your VLA More Robust Without More Data By Interleaving Motion Planning cs.RO · 2026-05-31 · unverdicted · none · ref 32 · internal anchor
MPVI interleaves model-based motion planning with VLAs via VLM completion checking to achieve 113% higher task progress on BEHAVIOR-1K without extra data.
TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues cs.RO · 2026-05-29 · unverdicted · none · ref 23 · internal anchor
TARIC maintains traversability-consistent guidance using 3D cue memory during semantic cue interruptions in outdoor VLN, improving success rates on long routes.
Extending Embodied Question Answering from Perception to Decision cs.RO · 2026-05-25 · unverdicted · none · ref 13 · internal anchor
Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.
CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation cs.RO · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
CLUE adaptively weights room-type and object-co-location cues from an LLM to construct a unified semantic value map that improves success rate and efficiency in zero-shot object-goal navigation.
What Limits Vision-and-Language Navigation ? cs.RO · 2026-05-13 · unverdicted · none · ref 58 · internal anchor
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning cs.RO · 2026-07-02 · unverdicted · none · ref 77 · internal anchor
CoRe combines VLM-designed formal rewards with VLM-labeled residual rewards to produce preference-aligned policies on robotic manipulation tasks.
Language-Driven Cost Optimization for Autonomous Driving cs.RO · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
LLM interprets user language to set parameters of a risk-aware MPPI controller, with human-in-the-loop validation for adaptive autonomous driving behavior.
Task Robustness via Re-Labelling Vision-Action Robot Data cs.RO · 2026-06-09 · unverdicted · none · ref 10 · internal anchor
TREAD augments robotics datasets via VLM-based sub-task generation, video segmentation, and linguistic diversity to improve policy generalization on novel tasks in LIBERO benchmarks.
WALL-WM: Carving World Action Modeling at the Event Joints cs.RO · 2026-06-01 · unverdicted · none · ref 17 · internal anchor
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.