super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

904 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 904 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

citing papers explorer

Showing 50 of 59 citing papers after filters.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models cs.AR · 2026-05-11 · conditional · none · ref 10 · internal anchor
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models cs.CL · 2026-04-13 · conditional · none · ref 25 · internal anchor
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views cs.RO · 2026-04-03 · conditional · none · ref 3 · internal anchor
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision cs.CV · 2026-02-15 · conditional · none · ref 3 · internal anchor
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation cs.CV · 2026-06-29 · conditional · none · ref 11 · internal anchor
Introduces EPIC-Contact dataset and HOPformer transformer for in-the-wild egocentric 3D hand-object pose estimation, reporting 82.4% success on ARCTIC and doubled success with 75% lower contact error on the new dataset.
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors cs.RO · 2026-06-26 · conditional · none · ref 57 · internal anchor
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models cs.CV · 2026-05-28 · conditional · none · ref 24 · internal anchor
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
Disentangling Language Roles in Multilingual LLM Task Execution cs.CL · 2026-05-26 · conditional · none · ref 1 · internal anchor
MTM-Bench shows response-language role dominates degradation in multilingual LLM execution, with single response mismatches explaining most failures and mismatch count not predicting difficulty monotonically.
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs cs.CV · 2026-05-21 · conditional · none · ref 12 · internal anchor
Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 14 · internal anchor
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting cs.CV · 2026-05-18 · conditional · none · ref 8 · internal anchor
MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably cs.CL · 2026-05-15 · conditional · none · ref 7 · internal anchor
Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation cs.CV · 2026-05-14 · conditional · none · ref 3 · internal anchor
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models cs.CV · 2026-05-14 · conditional · none · ref 27 · internal anchor
MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving cs.IR · 2026-05-13 · conditional · none · ref 5 · 2 links · internal anchor
LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild cs.AI · 2026-05-10 · conditional · none · ref 46 · 2 links · internal anchor
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration cs.CR · 2026-05-08 · conditional · none · ref 12 · internal anchor
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation cs.CV · 2026-04-24 · conditional · none · ref 11 · internal anchor
KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL · 2026-04-17 · conditional · none · ref 20 · internal anchor
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale cs.AI · 2026-04-11 · conditional · none · ref 10 · internal anchor
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices cs.AI · 2026-02-25 · conditional · none · ref 7 · internal anchor
ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.
Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 23 · internal anchor
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation cs.CL · 2026-01-12 · conditional · none · ref 3 · internal anchor
MENT benchmark plus RATE agentic evaluator raise combined system- and segment-level correlation with human judgments by at least 3.2 points over prior MT metrics and LLM judges.
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation cs.CV · 2026-01-06 · conditional · none · ref 6 · internal anchor
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 21 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
Setting the Stage: Text-Driven Scene-Consistent Image Generation cs.CV · 2025-12-14 · conditional · none · ref 3 · internal anchor
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
Agile Deliberation: Concept Deliberation for Subjective Visual Classification cs.AI · 2025-12-11 · conditional · none · ref 4 · internal anchor
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment cs.SE · 2025-10-21 · conditional · none · ref 3 · internal anchor
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models cs.CV · 2025-10-16 · conditional · none · ref 6 · internal anchor
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs cs.LG · 2025-09-29 · conditional · none · ref 9 · internal anchor
By sharing the B matrix across adapters instead of the A matrix, ALoRA and Fed-ALoRA deliver more balanced performance in multi-task and federated LLM fine-tuning.
StereoFoley: Object-Aware Stereo Audio Generation from Video cs.SD · 2025-09-22 · conditional · none · ref 31 · internal anchor
StereoFoley is an end-to-end video-to-stereo-audio framework that uses a base generative model fine-tuned on synthetic object-tracked data with panning and distance controls to achieve object-aware spatial sound.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 18 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models cs.AI · 2026-05-31 · conditional · none · ref 13 · internal anchor
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
Leveraging Latent Visual Reasoning in Silence cs.CV · 2026-05-18 · conditional · none · ref 5 · internal anchor
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 9 · internal anchor
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation cs.AI · 2026-05-14 · conditional · none · ref 4 · 2 links · internal anchor
OpenDeepThink uses Bradley-Terry aggregation of LLM pairwise judgments to rank and evolve parallel reasoning traces, improving Gemini 3.1 Pro Codeforces Elo by 405 points over eight rounds.
Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency cs.CV · 2026-05-13 · conditional · none · ref 11 · internal anchor
VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 11 · internal anchor
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering cs.AI · 2026-05-11 · conditional · none · ref 4 · internal anchor
PathISE generates pseudo path-level supervision from answer labels alone via a transformer estimator, distills it to an LLM path generator, and achieves competitive or state-of-the-art KGQA performance on three benchmarks without costly refined supervision.
Towards Conversational Medical AI with Eyes, Ears and a Voice cs.AI · 2026-05-10 · conditional · none · ref 3 · internal anchor
AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment cs.CV · 2026-05-08 · conditional · none · ref 41 · internal anchor
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 25 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy cs.AI · 2026-05-01 · conditional · none · ref 43 · internal anchor
VST combines deep-learning stuttering classification with multi-agent LLM reasoning to produce clinician-reviewed, evidence-based therapy plans for stuttering.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors cs.LG · 2026-05-01 · conditional · none · ref 9 · internal anchor
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 9 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 3 · internal anchor
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos cs.CV · 2026-01-11 · conditional · none · ref 2 · internal anchor
VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.
FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis cs.CV · 2025-12-19 · conditional · none · ref 8 · internal anchor
FPBench evaluates 20 MLLMs across 8 fingerprint tasks on 7 datasets and shows fine-tuning vision and language encoders improves performance by 7-39%.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 12 · internal anchor
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding cs.CV · 2025-12-06 · conditional · none · ref 8 · internal anchor
MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer