super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

916 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 916 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

citing papers explorer

Showing 50 of 916 citing papers.

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark cs.CV · 2026-06-04 · unverdicted · none · ref 98 · internal anchor
WorldBench is a visually diverse multimodal reasoning benchmark where the strongest of 15 tested MLLMs reaches only 64% accuracy.
Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues cs.CL · 2026-06-03 · unverdicted · none · ref 66 · internal anchor
Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.
HMARS: A Hierarchical Multi-Agent Memory System for Long-Context Reasoning cs.IR · 2026-06-03 · unverdicted · none · ref 79 · internal anchor
HMARS introduces a hierarchical multi-agent memory system that outperforms standard retrieval and other baselines on long-document and multi-turn reasoning tasks through improved evidence coverage.
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation cs.MM · 2026-06-03 · unverdicted · none · ref 10 · internal anchor
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
Self-Evolving Deep Research via Joint Generation and Evaluation cs.CL · 2026-06-03 · unverdicted · none · ref 7 · internal anchor
SCORE is a shared-parameter co-evolutionary framework coupling generation and evaluation of deep research reports with a meta-harness to adapt evaluation standards as performance improves.
From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models cs.LG · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
SLM adds a dedicated spatial modality and training dataset to LLMs, enabling geometric spatial reasoning and outperforming prompt-based symbolic methods on the new SpatialEval benchmark.
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories cs.LG · 2026-06-02 · unverdicted · none · ref 111 · internal anchor
Language models can use a two-stage sleep process of upward distillation for memory consolidation and RL-based dreaming for unsupervised self-improvement to enable continual learning.
Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection cs.AI · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
VEPO improves RL for visual reasoning by multiplicatively coupling visual sensitivity with token entropy, outperforming entropy-only baselines by 2.28 points at 7B and 3.15 points at 3B scale.
SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents cs.AI · 2026-06-02 · unverdicted · none · ref 19 · internal anchor
SkillPyramid introduces a hierarchical skill consolidation framework with self-evolution, reporting 38% higher average reward and 27.7% fewer execution steps on ALFWorld, WebShop, and ScienceWorld across four models.
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation cs.SD · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mixed-track consistency.
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning cs.AI · 2026-06-02 · unverdicted · none · ref 6 · internal anchor
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling cs.CL · 2026-06-02 · unverdicted · none · ref 108 · internal anchor
RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.
SimSD: Simple Speculative Decoding in Diffusion Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.
InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark cs.CV · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
The paper creates InsightVQA, a 725K QA-pair benchmark with perception, grounded-understanding, and cognition levels for emotion-cognitive visual question answering, plus a 30K-sample evaluation set and InsightNet baseline.
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.
Scaling Agentic Capabilities via Grounded Interaction Synthesis cs.CL · 2026-06-01 · unverdicted · none · ref 9 · internal anchor
GAIS synthesizes diverse, high-fidelity agentic tasks from real-world MCP servers and adversarial planning, outperforming LLM-only baselines on BFCL, τ²-Bench, and ACEBench with greater data efficiency.
QoEReasoner: An Agentic Reasoning Framework for Automated and Explainable QoE Diagnosis in RANs cs.MA · 2026-06-01 · unverdicted · none · ref 9 · internal anchor
QoEReasoner is an agentic framework using LLMs with deterministic KPI tools, a domain knowledge base, and a historical case bank orchestrated by a stateful planner for automated, explainable QoE diagnosis in RANs, claiming 18-40% accuracy gains and reduction to 3-minute sessions on real datasets.
UniVocal: Unified Speech-Singing Code-Switching Synthesis cs.SD · 2026-06-01 · unverdicted · none · ref 37 · internal anchor
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models cs.AI · 2026-05-31 · conditional · none · ref 13 · internal anchor
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
TECCI: Tricky Edits of Collected and Curated Images cs.CV · 2026-05-31 · unverdicted · none · ref 13 · internal anchor
TECCI benchmark reveals that five leading text-guided image editing models all achieve under 22% overall success rate on challenging edits, with particular struggles on architecture, nature, reasoning, and creative tasks.
ProductWebGen: Benchmarking Multimodal Product Webpage Generation cs.CV · 2026-05-31 · unverdicted · none · ref 6 · internal anchor
Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search cs.AI · 2026-05-30 · unverdicted · none · ref 59 · internal anchor
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding cs.CV · 2026-05-30 · unverdicted · none · ref 5 · internal anchor
Decomposes VLM distillation loss into orthogonal language and visual components and introduces Visual Gradient Steering to prioritize visual grounding over standard monolithic optimization.
LaSR: Context-Aware Speech Recognition via Latent Reasoning cs.CL · 2026-05-30 · unverdicted · none · ref 7 · internal anchor
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate cs.LG · 2026-05-29 · unverdicted · none · ref 209 · internal anchor
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
Learning to Construct Practical Agentic Systems cs.LG · 2026-05-29 · unverdicted · none · ref 33 · internal anchor
A modular agent framework with pseudo-tools and learned fixed workflows that are cheaper and more accurate than dynamic planning, plus multi-objective optimization for cost and quality.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
Task-Focused Memorization for Multimodal Agents cs.CV · 2026-05-29 · unverdicted · none · ref 12 · internal anchor
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward cs.AI · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.
Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models cs.LG · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
Entropy-based test-time compute (ETTC) in VLM ensembles outperforms majority voting by prioritizing high-confidence predictions from stronger models.
VLM3: Vision Language Models Are Native 3D Learners cs.CV · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
GenClaw: Code-Driven Agentic Image Generation cs.CV · 2026-05-28 · unverdicted · none · ref 11 · internal anchor
GenClaw introduces a three-stage code-driven workflow for agentic image generation that inserts programmatic sketches between linguistic reasoning and pixel synthesis.
CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild cs.CL · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
CommunityFact provides a new dynamic benchmark showing web access improves LLM misinformation detection but source selection remains misaligned with human Community Notes raters.
From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals cs.CL · 2026-05-28 · unverdicted · none · ref 27 · internal anchor
MaterEval generates paired informed and blind evaluations as preference signals to improve small open-source LLMs on high-entropy alloy assessment, approaching closed-source performance without external retrieval.
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
OmniRetrieval dispatches natural-language queries to native engines across text, relational and graph sources without collapsing them into one representation.
IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams cs.CV · 2026-05-26 · unverdicted · none · ref 8 · internal anchor
IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.
PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions cs.CL · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
PersLitEval benchmark shows LLMs perform better on conceptual Persian literature tasks than spelling or word formation, with explained few-shot prompting yielding the strongest results across six models.
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors cs.CL · 2026-05-26 · unverdicted · none · ref 9 · internal anchor
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers cs.CL · 2026-05-26 · unverdicted · none · ref 40 · internal anchor
PRISM benchmark finds LLMs match or exceed humans on isolated review dimensions like novelty verification but none achieve the balanced performance of human reviewers across depth, flaw prioritization, and constructiveness.
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding cs.CV · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.
O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding cs.CV · 2026-05-26 · unverdicted · none · ref 5 · internal anchor
O-MARC is a compression distillation framework that lets compact omnimodal models maintain or exceed full-token performance on video QA while cutting latency and memory by about 35%.
InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward cs.CV · 2026-05-26 · unverdicted · none · ref 36 · internal anchor
InterSketch improves long-horizon visual-textual chain-of-thought in VLMs by dynamically generating and interleaving self-correcting visual sketches with text, using a synthesized dataset plus reflection in cold-start followed by stepwise-reward RL, and reports outperforming Gemini-3-Pro on benchmar
InstructSAM: Segment Any Instance with Any Instructions cs.CV · 2026-05-25 · unverdicted · none · ref 14 · internal anchor
InstructSAM uses learnable queries in a VLM to condition SAM3 for single-pass multi-instance segmentation from arbitrary instructions, with a new Inst2Seg benchmark.
DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models cs.CV · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
New benchmark DRBench and four-stage supervision framework DRScaffold improve dense-scene reasoning in lightweight VLMs, with a 3B model surpassing a frozen 32B model on the benchmark while maintaining general performance.
StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering cs.CV · 2026-05-25 · unverdicted · none · ref 31 · internal anchor
StreamOV proposes evidence-guided long-short term memory and a hidden-state-driven trigger for efficient online audio-visual reasoning in streaming videos, along with the SOVBench benchmark for multi-turn evaluation.
MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model cs.CV · 2026-05-25 · unverdicted · none · ref 30 · internal anchor
New temporally annotated laughter datasets and a weakly-supervised multimodal model using HuBERT and MAE encoders with adaptive gating achieve 99% F1 and improve downstream reasoning by 227% on CIDEr.
STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media cs.CL · 2026-05-24 · unverdicted · none · ref 7 · internal anchor
Stream mines streaming media to create and release StreamDial, a dataset of 87,498 structured task-oriented dialogue sessions across automotive, restaurant, and hotel domains using persona construction, Conversational Blueprints, and RAG.
Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing cs.CV · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.
FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis cs.CV · 2026-05-23 · unverdicted · none · ref 9 · internal anchor
FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.
SEAL: Synergistic Co-Evolution of Agents and Learning Environments cs.CL · 2026-05-23 · unverdicted · none · ref 51 · internal anchor
SEAL co-evolves LLM agents and environments via shared turn-level failure diagnoses, yielding +8.25 to +26.25 point gains on tool-use tasks with only 400 samples.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer