super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

943 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 943 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

citing papers explorer

Showing 50 of 943 citing papers.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule cs.CL · 2026-06-04 · accept · none · ref 54 · internal anchor
HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.
RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection eess.AS · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents cs.CV · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning cs.CV · 2026-05-17 · unverdicted · none · ref 6 · internal anchor
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Tracing Persona Vectors Through LLM Pretraining cs.CL · 2026-05-13 · unverdicted · none · ref 30 · internal anchor
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models cs.AR · 2026-05-11 · conditional · none · ref 10 · internal anchor
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation cs.CR · 2026-05-11 · unverdicted · none · ref 56 · internal anchor
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search cs.SD · 2026-05-09 · unverdicted · none · ref 34 · internal anchor
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 4 · internal anchor
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 6 · internal anchor
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 6 · internal anchor
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 10 · internal anchor
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models cs.CL · 2026-04-13 · conditional · none · ref 25 · internal anchor
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark cs.CV · 2026-04-12 · unverdicted · none · ref 6 · 2 links · internal anchor
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing cs.CV · 2026-04-10 · accept · none · ref 7 · internal anchor
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views cs.RO · 2026-04-03 · conditional · none · ref 3 · internal anchor
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision cs.CV · 2026-02-15 · conditional · none · ref 3 · internal anchor
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CV · 2026-02-15 · unverdicted · none · ref 7 · internal anchor
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing cs.CV · 2026-02-04 · unverdicted · none · ref 6 · internal anchor
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 25 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors cs.CV · 2025-12-09 · unverdicted · none · ref 6 · internal anchor
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos cs.CV · 2025-12-03 · accept · none · ref 6 · internal anchor
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 22 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 270 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation cs.CV · 2026-06-29 · conditional · none · ref 11 · internal anchor
Introduces EPIC-Contact dataset and HOPformer transformer for in-the-wild egocentric 3D hand-object pose estimation, reporting 82.4% success on ARCTIC and doubled success with 75% lower contact error on the new dataset.
Multimodal Graph RAG for Long-range Visually Rich Document Understanding cs.IR · 2026-06-27 · unverdicted · none · ref 1 · internal anchor
Multimodal graph RAG with DLVQA benchmark outperforms MMRAG and KG methods on multi-hop document VQA tasks.
Self-Supervised Theorem Discovery in a Formal Axiomatic System cs.AI · 2026-06-27 · unverdicted · none · ref 33 · internal anchor
A self-supervised agent alternates proof search and theorem extraction in a formal system, discovers tens of thousands of theorems, solves human benchmarks, and boosts LLM proof performance when used as lemmas.
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors cs.RO · 2026-06-26 · conditional · none · ref 57 · internal anchor
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge cs.CV · 2026-06-25 · unverdicted · none · ref 70 · internal anchor
LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing cs.CV · 2026-06-25 · unverdicted · none · ref 10 · internal anchor
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes cs.CR · 2026-06-24 · unverdicted · none · ref 61 · internal anchor
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages cs.CL · 2026-06-10 · unverdicted · none · ref 37 · internal anchor
AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.
VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving cs.CV · 2026-06-10 · unverdicted · none · ref 8 · internal anchor
VLADriveBench combines observational metrics and CoT intervention protocols to evaluate the relevance and causality of reasoning in vision-language-action models for autonomous driving, revealing divergent model behaviors.
Forecasting Future Behavior as a Learning Task cs.AI · 2026-06-09 · unverdicted · none · ref 45 · internal anchor
Behavior Forecasters trained on LRM trajectories outperform larger models in predicting repeatability and input sensitivity at low cost.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks cs.AI · 2026-06-08 · unverdicted · none · ref 61 · internal anchor
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text cs.AI · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning cs.CV · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
Co-Evolving Skill Generation and Policy Optimization cs.CL · 2026-06-07 · unverdicted · none · ref 84 · internal anchor
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding cs.AI · 2026-06-06 · accept · none · ref 20 · internal anchor
MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.
Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics cs.CL · 2026-06-06 · unverdicted · none · ref 50 · internal anchor
SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.
Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them cs.CV · 2026-06-04 · unverdicted · none · ref 77 · internal anchor
PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.
Towards One-to-Many Temporal Grounding cs.CV · 2026-06-04 · unverdicted · none · ref 8 · internal anchor
Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.
DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments cs.CV · 2026-06-04 · unverdicted · none · ref 52 · internal anchor
DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.
Would you still call this Dax? Novel Visual References in VLMs and Humans cs.CV · 2026-06-03 · unverdicted · none · ref 16 · internal anchor
Presents NVRD benchmark and finds VLMs struggle to acquire novel contradictory concepts in-context while overgeneralizing relative to human judgments.
Reinforcement Learning from Rich Feedback with Distributional DAgger cs.LG · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.
GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes cs.CV · 2026-06-03 · unverdicted · none · ref 15 · internal anchor
GeM-NR performs multi-view consistent nonrigid editing by aligning depth-derived point clouds between edited and unedited scenes then refining projections conditioned on the original query view.
MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation cs.CV · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.
Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs cs.CL · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer