HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.
super hub Mixed citations
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Mixed citation behavior. Most common role is background (55%).
abstract
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G
authors
co-cited works
representative citing papers
Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
citing papers explorer
-
KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems
KompeteAI accelerates AutoML pipeline evaluation 6.9 times and beats prior systems by 3% on MLE-Bench through candidate merging, external RAG, and predictive early scoring.
-
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
-
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
-
In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis
A framework for nuanced, time-aware research impact summarization using fine-grained temporal citation intents shows moderate to strong correlation with human judgments on insightfulness.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
-
Leveraging ASIC AI Chips for Homomorphic Encryption
CROSS compiler maps HE workloads to TPU architecture via basis-aligned and memory-aligned transformations, reporting higher throughput-per-watt than prior GPU and ASIC libraries on NTT and HE operators.
-
LVBench: An Extreme Long Video Understanding Benchmark
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
-
FlowEval: Reference-based Evaluation of Generated User Interfaces
FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
-
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
-
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
-
ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation
ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.
-
StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning
StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.
-
OnPoint: Offline-to-Online Multi-Level Distillation for Point-Supervised Online Temporal Action Localization
OnPoint enables point-supervised online temporal action localization by distilling pseudo-segments, class-activation sequences, and anticipatory windows from an offline teacher to an online student.
-
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
Prediction agreement between open and closed LLMs substantially overstates agreement on attributions and causal reasons.
-
UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation
UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.
-
Benchmarking Large Language Models on Floating-Point Error Classification
Introduces InterFLOPBench benchmark and evaluates 14 LLMs on multi-label classification of six floating-point error categories in C code, with top models exceeding 0.88 overall F1 but lower scores on subtle errors like underflow.
-
Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support
TheraJudge, trained via preference optimization on human annotations, reaches high clinician agreement (ICC 0.87-0.95) and, when used by TheraAgent, raises human-rated therapeutic quality by 0.43 points on a 5-point scale with 94% recovery of low-quality responses.
-
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
-
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
-
Open Problems in Constitutional Preference Reconstruction
Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.
-
StrucTab: A Structured Optimization Framework for Table Parsing
StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.
-
Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework
Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.
-
AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World
AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.
-
MotionAtlas: Detailed Region Captioning for Motion-Centric Videos
MotionAtlas supplies a 2,073-question benchmark, a self-bootstrap pipeline yielding 159k captions, and fine-tuned Video-MLLMs that deliver 5.2-point gains over Qwen3-VL-4B on motion tasks.
-
PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents
PolicyGuard is a dialogue-grounded sub-agent verifier that raises PASS4 scores by 6-12 points on an airline benchmark while catching more violations with fewer blocks than argument-level guards.
-
The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning
Benign multilingual fine-tuning causes language-specific safety drifts with adversarial compliance rates rising up to four-fold, decoupled from capability gains.
-
A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models
CrashTwin is a new benchmark framework that exposes physical violations in state-of-the-art world models during multi-agent collisions despite high visual quality.
-
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
-
Intuition-Guided Latent Reasoning for LLM-Based Recommendation
IntuRec anchors LLM latent reasoning for recommendation by deriving an intuition embedding from top-K candidates via self- and cross-attention to initialize more accurate trajectories.
-
Fine-tuning a multimodal large language model for clinician-grade autism behavioral scoring from short home videos
Fine-tuning Gemini 2.5 Pro with LoRA on 400 home videos improves per-feature agreement with clinicians by 40% and zero-shot ASD diagnosis F1 by 53% on held-out data, with classifier pipelines reaching 77% accuracy.
-
DiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models
DiARC improves LLM performance on ARC-like benchmarks by constructing and training on preference pairs from three types of negative samples while keeping demonstrations fixed.
-
Text Over Image: Auditing Multimodal Robustness in Synthetic Medical Image Detection
VLMs for synthetic medical image detection overweight text metadata, flipping authenticity judgments on the same image and dropping accuracy on authentic images by 61.1% on average when an explicit AI-origin tag is present.
-
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs
FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.
-
RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought
Introduces PinCoT paradigm with visual reasoning anchors, builds PIN-170K dataset via automated pipeline, and trains 4B RoboPIN model via three-stage post-training to outperform 7B baselines by 12% on embodied reasoning benchmarks.
-
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
-
SPA: A SQL-Plan-Aware Reinforcement Learning Framework for Query Rewriting with LLMs
SPA trains LLMs via plan-aware RL with adaptive reward shaping and self-improvement on slowdowns to produce faster query rewrites than rule-based or standard LLM methods on IID and OOD workloads.
-
Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries
MO-PQUCB hybrid algorithm integrates proactive conversational queries with bandit feedback via shift-invariant regularization to achieve improved regret bounds in personalized multi-objective bandits.
-
Arabic Sentence Segmentation Across Genres and Punctuation Conditions
AraSEG is a genre-diverse Arabic sentence segmentation corpus showing lightweight encoders and dependency parsers outperform LLMs under challenging punctuation while improving downstream parsing.
-
3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing
3DMorph transfers local modifications from a single edited 2D image to the corresponding regions of a 3D mesh without training and supports shape morphing between original and edited versions.
-
MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights
MADE is a new multilingual agentic diagnosing engine that produces higher-quality diagnostic reports (47% better than baseline) on a large-scale evaluation substrate covering 33 model families and 26 languages.
-
Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding
LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.
-
Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition
A POI-aware contrastive training framework using LLM-generated near-misses reduces both general and CS-aware error rates by over 2% on cmn-eng and vie-eng code-switching ASR datasets compared to standard LoRA fine-tuning.
-
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.
-
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
LLM reasoning failures split into committed (early lock-in) and persistent-uncertainty modes with distinct token-level signatures that hold across 23 model-dataset pairs in 20 of 23 falsifiable tests.
-
CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model
CogManip is a benchmark that tests 13 LLMs on 15 manipulation risks in 1,000 multi-turn dialogues, finding heterogeneous risks and prompt sensitivity in models like DeepSeek-V3.2.
-
Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs
Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.
-
TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents
TAPO corrects credit misassignment in RL for multimodal search agents by using tool parameter similarity to share advantages across equivalent actions.
-
Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models
Introduces ChronoVision benchmark with three datasets showing VLMs rely on superficial cues such as color filters rather than genuine chronological reasoning.
-
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions
ShotCrop uses three-stage training (CoT SFT, pseudo-label semi-supervised, GRPO-S) to produce triple-shot compositions and reports 2.82x better shot localization than GPT-5 on a 1.2k expert benchmark.