super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

994 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 994 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

citing papers explorer

Showing 50 of 994 citing papers.

Latent Confidence Alignment for LLM Self-Assessment cs.CY · 2026-06-20 · unverdicted · none · ref 38 · internal anchor
LCAE is introduced as a Rasch-model metric that aligns LLM self-reported confidence with latent error probability derived from ability and item difficulty, shown to improve calibration on a medical dataset across 20 models.
DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams cs.LG · 2026-06-19 · unverdicted · none · ref 7 · internal anchor
DataClaw0 introduces an agentic data-tailoring paradigm, a 9B model trained on a synthetically generated dataset, and a new benchmark, claiming improved downstream adaptation in video generation, VQA, and GUI navigation under limited data.
Sakana Fugu Technical Report cs.LG · 2026-06-19 · unverdicted · none · ref 252 · internal anchor
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
CogniRoute: Learning to Route Social Evidence in Omni-Modal Models cs.CV · 2026-06-18 · unverdicted · none · ref 116 · internal anchor
CogniRoute adds a cognitive schema and route-aware RL to an omni-modal MoE, reaching 59.38% accuracy on a new 118K-example social video QA benchmark and beating prior baselines by 15-27 points.
PromptMark: A Prompt-Guided Iterative-Feedback Framework for Source Code Watermarking cs.CR · 2026-06-18 · unverdicted · none · ref 23 · internal anchor
PromptMark is a black-box prompt-guided iterative-feedback framework that embeds statistically detectable watermarks in LLM-generated source code via naming patterns while preserving functional correctness.
Multi-View Decompilation for LLM-Based Malware Classification cs.CR · 2026-06-18 · unverdicted · none · ref 17 · internal anchor
Multi-decompiler prompting improves LLM malware classification F1 by supplying complementary views of the same binary.
Qiskit Code Migration with LLMs cs.SE · 2026-06-18 · unverdicted · none · ref 141 · internal anchor
A taxonomy-guided RAG system with LLMs reduces hallucinations and improves migration suggestions for Qiskit code compared to unconstrained retrieval.
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment cs.CL · 2026-06-17 · unverdicted · none · ref 12 · internal anchor
LLMs achieve maximum Spearman correlations of 0.152 (direct) and 0.241 (response-based) with human item discrimination values, showing non-random but unreliable signal for distinguishing student proficiency.
WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning cs.AI · 2026-06-16 · unverdicted · none · ref 14 · internal anchor
WEQA proposes a query-adaptive agent framework combining LLMs with wearable data tools, achieving 24% higher accuracy than baselines on a benchmark from four open datasets, with gains in expert-rated usefulness.
From Drift to Coherence: Stabilizing Beliefs in LLMs cs.LG · 2026-06-16 · unverdicted · none · ref 23 · internal anchor
In multiple-choice QA, LLM beliefs drift early under repeated sampling but self-stabilize; seed-answer prompting and a self-consistency loss reduce drift while preserving accuracy.
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning cs.LG · 2026-06-16 · unverdicted · none · ref 4 · internal anchor
EnvRL incorporates environment dynamics learning via state prediction and inverse dynamics auxiliary objectives into agentic RL, reporting higher success rates than RL-only baselines on ALFWorld and WebShop.
Evaluating Pluralism in LLMs through Latent Perspectives cs.CL · 2026-06-11 · unverdicted · none · ref 33 · internal anchor
A domain-agnostic framework extracts perspectives from book reviews showing LLMs underrepresent rarer viewpoints relative to human text.
Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation cs.CL · 2026-06-10 · unverdicted · none · ref 1 · internal anchor
Introduces PAND dataset for Persian proverbs and reports a persistent decompression gap in LLMs that explicit reasoning partially reduces.
APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection cs.CL · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
APEX dynamically tiers data into Easy/Hard/Mixed based on optimization lineage and prioritizes Mixed examples, reporting 11.2% and 6.8% average gains over baseline prompts on two models under a 5,000-call budget.
GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models cs.CL · 2026-06-06 · unverdicted · none · ref 54 · internal anchor
GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.
StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents cs.AI · 2026-06-05 · unverdicted · none · ref 6 · internal anchor
StainFlow proposes global entity stain tracking and local stain evidence linking modules to improve process rewards for GUI agents, reporting 3.2% relative gain in online RL success and 1.8% in judgment accuracy on AndroidWorld and OGRBench.
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 6 · internal anchor
GeoVR distills camera pose, depth, scale, and multi-scale 3D features from pre-trained models into MLLMs via video supervision to improve spatial reasoning.
LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video cs.CV · 2026-06-04 · unverdicted · none · ref 34 · internal anchor
Presents LongSpace-Bench benchmark and LongSpace framework that chunks long videos, adds 3D structural cues, and builds layer-aware memory to improve spatial reasoning in multimodal LLMs.
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space cs.CV · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
InstantRetouch performs efficient high-fidelity language-guided retouching via bilateral grid prediction of affine transforms combined with variational score distillation from diffusion models.
Rethinking Continual Experience Internalization for Self-Evolving LLM Agents cs.CL · 2026-06-03 · unverdicted · none · ref 49 · internal anchor
Existing methods for turning LLM interaction experience into parametric skills collapse over multiple iterations; principle-level experience, step-wise injection, and off-policy teacher distillation yield more stable continual learning.
VCIFBench: Evaluating Complex Instruction Following for Video Understanding cs.CL · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
VCIFBench provides 306 test instructions, a 540-pair DPO dataset, and a conflict diagnostic set to evaluate complex constraint satisfaction in video MLLMs, finding it challenging and showing DPO training helps.
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling cs.LG · 2026-06-03 · unverdicted · none · ref 3 · internal anchor
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents cs.AI · 2026-06-02 · unverdicted · none · ref 46 · internal anchor
PRPF uses a lightweight Multimodal Proactive Perceptor for intervention gating and context compression, activating the Proactive Agent Reasoner only when needed, reducing false trigger rates and improving efficiency on the ProactiveMobile benchmark.
Iteris: Agentic Research Loops for Computational Mathematics cs.AI · 2026-06-01 · unverdicted · none · ref 7 · internal anchor
Iteris, an agentic research system, produced evidence and drafts for two open computational math problems that were verified after human correction.
MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching cs.CV · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
MT-EditFlow applies flow-matching RL with multi-reward aggregation to improve multi-turn image editing performance on models like FLUX.1-Kontext-dev by 6.85 points at turn-3.
Enhancing the Socioeconomic Understanding of Foundation Models with Urban Mobility cs.SI · 2026-06-01 · unverdicted · none · ref 47 · internal anchor
MobFusion fuses mobility networks into foundation models via three designs and reports improved performance on income, density, and crime prediction tasks using data from three U.S. metropolitan areas.
Agent Skills Should Go Beyond Text: The Case for Visual Skills cs.CV · 2026-05-31 · unverdicted · none · ref 7 · internal anchor
The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
On the Generalization Gap in Self-Evolving Language Model Reasoning cs.CL · 2026-05-31 · unverdicted · none · ref 5 · internal anchor
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
Make Your VLA More Robust Without More Data By Interleaving Motion Planning cs.RO · 2026-05-31 · unverdicted · none · ref 32 · internal anchor
MPVI interleaves model-based motion planning with VLAs via VLM completion checking to achieve 113% higher task progress on BEHAVIOR-1K without extra data.
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications cs.CL · 2026-05-30 · unverdicted · none · ref 10 · internal anchor
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
MESA: Improving MoE Safety Alignment via Decentralized Expertise cs.LG · 2026-05-30 · unverdicted · none · ref 10 · internal anchor
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
Linear Scaling Video VLMs for Long Video Understanding cs.CV · 2026-05-29 · unverdicted · none · ref 17 · internal anchor
StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.
DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval cs.IR · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
DynaTree separates offline agentic tree construction from online subtree selection to deliver better recall, ranking, and production survival rates than standard or prior agentic RAG for news retrieval.
TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues cs.RO · 2026-05-29 · unverdicted · none · ref 23 · internal anchor
TARIC maintains traversability-consistent guidance using 3D cue memory during semantic cue interruptions in outdoor VLN, improving success rates on long routes.
Archon: A Unified Multimodal Model for Holistic Digital Human Generation cs.CV · 2026-05-28 · unverdicted · none · ref 11 · internal anchor
Archon unifies seven modalities via modality-specific tokenizers and an autoregressive backbone pretrained on 72 tasks, plus a 4x-efficient video reparameterization and stepwise 'Thinking in Modality' procedure, and reports superior or comparable results on digital-human tasks.
Grounded 3D-Aware Spatial Vision-Language Modeling cs.CV · 2026-05-28 · unverdicted · none · ref 105 · internal anchor
GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark cs.CV · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
HTAM: Hierarchical Transition-Attended Memory for Operator Optimization cs.CL · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
HTAM builds a Hierarchical Transition Graph to organize coarse global directions and detailed local strategies for guiding LLM-based CUDA kernel optimization, improving results on KernelBench.
GEM: Generative Supervision Helps Embodied Intelligence cs.CV · 2026-05-27 · unverdicted · none · ref 15 · internal anchor
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization cs.CL · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
The Future of Facts: Tracing the Factual Generation-Verification Gap cs.CL · 2026-05-26 · unverdicted · none · ref 60 · internal anchor
Empirical tracing across model families shows verification precedes and outlasts generation for facts, with updates producing simultaneous verification of old and new answers.
WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification cs.CL · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
Authors develop a human-LLM collaborative annotation framework and construct the WhoSaidIt multilingual dataset for nine speaker-attribute labels, revealing cross-lingual annotation differences and LLM limitations.
Extending Embodied Question Answering from Perception to Decision cs.RO · 2026-05-25 · unverdicted · none · ref 13 · internal anchor
Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement cs.CV · 2026-05-21 · unverdicted · none · ref 5 · 2 links · internal anchor
The paper presents a case-aware multimodal knowledge graph approach for medical image classification that retrieves similar cases, propagates knowledge via graph attention, and refines predictions with reliability estimates.
One-Way Policy Optimization for Self-Evolving LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling cs.AI · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support cs.AI · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation cs.LG · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models cs.CV · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
Vision language models are used in zero-shot mode to infer vehicle make/model/generation and accurate 3D dimensions from image crops, improving label quality and reducing manual effort especially under occlusion.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer