super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

925 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 925 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

citing papers explorer

Showing 50 of 925 citing papers.

InstructSAM: Segment Any Instance with Any Instructions cs.CV · 2026-05-25 · unverdicted · none · ref 14 · internal anchor
InstructSAM uses learnable queries in a VLM to condition SAM3 for single-pass multi-instance segmentation from arbitrary instructions, with a new Inst2Seg benchmark.
DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models cs.CV · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
New benchmark DRBench and four-stage supervision framework DRScaffold improve dense-scene reasoning in lightweight VLMs, with a 3B model surpassing a frozen 32B model on the benchmark while maintaining general performance.
StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering cs.CV · 2026-05-25 · unverdicted · none · ref 31 · internal anchor
StreamOV proposes evidence-guided long-short term memory and a hidden-state-driven trigger for efficient online audio-visual reasoning in streaming videos, along with the SOVBench benchmark for multi-turn evaluation.
MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model cs.CV · 2026-05-25 · unverdicted · none · ref 30 · internal anchor
New temporally annotated laughter datasets and a weakly-supervised multimodal model using HuBERT and MAE encoders with adaptive gating achieve 99% F1 and improve downstream reasoning by 227% on CIDEr.
STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media cs.CL · 2026-05-24 · unverdicted · none · ref 7 · internal anchor
Stream mines streaming media to create and release StreamDial, a dataset of 87,498 structured task-oriented dialogue sessions across automotive, restaurant, and hotel domains using persona construction, Conversational Blueprints, and RAG.
Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing cs.CV · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.
FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis cs.CV · 2026-05-23 · unverdicted · none · ref 9 · internal anchor
FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.
SEAL: Synergistic Co-Evolution of Agents and Learning Environments cs.CL · 2026-05-23 · unverdicted · none · ref 51 · internal anchor
SEAL co-evolves LLM agents and environments via shared turn-level failure diagnoses, yielding +8.25 to +26.25 point gains on tool-use tasks with only 400 samples.
OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations cs.CL · 2026-05-22 · unverdicted · none · ref 36 · internal anchor
OnePred maintains a recursively updated intent memory and uses two-stage RL to predict next queries, cutting token use by up to 22x while outperforming baselines on a new NQP-Bench dataset.
Autonomous Frontier-Based Exploration with VLM Guidance cs.RO · 2026-05-22 · unverdicted · none · ref 23 · internal anchor
A VLM-based method for selecting exploration frontiers in robotics achieves up to 24% better map coverage than standard geometric heuristics in simulated indoor environments.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 25 · internal anchor
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues cs.CV · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
RoadTones: Tone Controllable Text Generation from Road Event Videos cs.CV · 2026-05-20 · unverdicted · none · ref 3 · internal anchor
Presents RoadTones-51K dataset, RoadTones-VL-CoT model with tone-conditioned CoT, and RoadTones-Eval suite for controllable tone in road video captioning, supported by user study.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 76 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Sample Complexity of Transfer Learning: An Optimal Transport Approach stat.ML · 2026-05-19 · unverdicted · none · ref 8 · internal anchor
Transfer learning achieves sample complexity O(m^{-(α+1)/d}) for d>3 via optimal transport, outperforming direct learning's O(m^{-p/d}) when target models are not smooth.
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems cs.AI · 2026-05-19 · unverdicted · none · ref 13 · 3 links · internal anchor
ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.
Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching cs.MM · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
An interactive system re-centers the I-Ching as a meaning-bearing framework by combining Wen Wang Fa coin casting, Gemini LLM interpretation of hexagrams, and Lyria generative music to produce responsive sonic realizations tied to user inquiry.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 5 · 2 links · internal anchor
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.
VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 36 · internal anchor
VL-DPO uses a VLM as a zero-shot reasoner to generate preference pairs from pretrained model rollouts, then finetunes via DPO on the Waymo Open End-to-End Driving Dataset, yielding 11.94% higher rater feedback score and 10.01% lower average displacement error.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA cs.CV · 2026-05-19 · unverdicted · none · ref 42 · internal anchor
JUDO enhances large multimodal models for industrial anomaly QA by juxtaposing query images with normal ones for visual comparison and using SFT plus GRPO with tailored rewards to inject domain knowledge, outperforming Qwen2.5-VL-7B and GPT-4o on the MMAD benchmark.
What Makes Synthetic Data Effective in Image Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 6 · internal anchor
Dense scene composition and instance fidelity in synthetic diffusion images drive better segmentation performance; SENSE framework exploits this to improve models on Cityscapes, COCO, and ADE20K.
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation cs.CV · 2026-05-18 · unverdicted · none · ref 5 · 2 links · internal anchor
Vision-OPD transfers an MLLM's privileged regional perception to its full-image policy through on-policy token-level self-distillation, yielding competitive results on fine-grained visual benchmarks.
Leveraging Latent Visual Reasoning in Silence cs.CV · 2026-05-18 · conditional · none · ref 5 · internal anchor
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics cs.CL · 2026-05-18 · unverdicted · none · ref 13 · internal anchor
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models cs.CR · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.
Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework cs.AI · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.
How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking cs.CL · 2026-05-18 · unverdicted · none · ref 29 · internal anchor
Introduces BanglaMedVQA dataset of clinically validated image-question-answer pairs and benchmarks foundation models, finding substantially lower performance than on English MedVQA especially on diagnostic questions.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 13 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code cs.CL · 2026-05-18 · unverdicted · none · ref 62 · internal anchor
AutoVecCoder combines VecPrompt for automated intrinsic knowledge synthesis and VecRL for efficiency-aligned RL to train an 8B LLM that achieves SOTA on SimdBench SSE/AVX subsets and sometimes exceeds -O3 compiler results.
FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation cs.CE · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cross-page grounding.
Multi-agent AI systems outperform human teams in creativity cs.CL · 2026-05-18 · unverdicted · none · ref 47 · internal anchor
Multi-agent LLM teams outperform human teams in creativity (d=1.50) across tasks by producing more novel ideas, with distinct semantic exploration patterns predicting success for each group.
VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems cs.CL · 2026-05-17 · unverdicted · none · ref 5 · internal anchor
VerifyMAS improves failure attribution in LLM multi-agent systems via hypothesis verification on full trajectories, error taxonomy-based data construction, and fine-tuned verifier models, outperforming prior direct-prediction methods on Aegis-Bench and Who&When.
Dual-Diffusional Generative Fashion Recommendation cs.IR · 2026-05-17 · unverdicted · none · ref 37 · internal anchor
DualFashion introduces a dual-diffusion Transformer with image and text branches that generates both visual items and semantic descriptions for explainable personalized fashion recommendation.
Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework cs.IR · 2026-05-17 · unverdicted · none · ref 2 · internal anchor
2D-ProteinRAG is a dual-dimensional RAG framework that incorporates BLAST workflows plus horizontal attribute alignment and vertical homology denoising to improve protein-text QA on both in-distribution and out-of-distribution cases.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 24 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps cs.CL · 2026-05-16 · unverdicted · none · ref 3 · 2 links · internal anchor
RTPurbo converts full-attention LLMs to sparse attention by retaining full KV for retrieval heads and using a low-dimensional dynamic indexer, achieving near-lossless accuracy after minimal adaptation.
MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation cs.CV · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
MAVEN introduces a multi-agent system for refining prompts in multicultural text-to-video generation and releases a benchmark of 243 prompts and 972 videos showing improved cultural relevance via parallel agent specialization.
PrivScope: Task-scoped Disclosure Control for Hybrid Agentic Systems cs.CR · 2026-05-15 · unverdicted · none · ref 51 · internal anchor
PrivScope enforces task-scoped disclosure at the local-cloud boundary in hybrid agents, eliminating profile leakage and halving re-identification risk on medical workflows while preserving task success.
Controlla: Learning Controllability via Graph-Constrained Latent Geometry cs.CV · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport to enforce consistent attribute trajectories while preserving reference identity.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV · 2026-05-15 · unverdicted · none · ref 8 · internal anchor
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding cs.CV · 2026-05-15 · unverdicted · none · ref 16 · internal anchor
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
Video Models Can Reason with Verifiable Rewards cs.CV · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 9 · internal anchor
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation cs.AI · 2026-05-14 · conditional · none · ref 4 · 2 links · internal anchor
OpenDeepThink uses Bradley-Terry aggregation of LLM pairwise judgments to rank and evolve parallel reasoning traces, improving Gemini 3.1 Pro Codeforces Elo by 405 points over eight rounds.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance cs.AI · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax cs.CL · 2026-05-14 · unverdicted · none · ref 28 · internal anchor
Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer