HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.
super hub Mixed citations
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Mixed citation behavior. Most common role is background (55%).
abstract
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G
authors
co-cited works
representative citing papers
Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
citing papers explorer
-
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.
-
Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
-
Efficient Remote KV Cache Reuse with GPU-native Video Codec
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
-
Bayesian Preference Learning for Test-Time Steerable Reward Models
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
-
TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration
TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
-
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
BenchMarker toolkit audits 12 MCQA benchmarks for contamination, shortcuts, and writing errors using LLM judges, finding widespread flaws that inflate or deflate accuracy and alter rankings.
-
AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
AGILE generates complete object meshes via VLM-guided synthesis and tracks poses with anchor-and-track plus contact-aware optimization to achieve robust hand-object reconstruction from video.
-
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
-
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
-
LLM-Assisted Op-Amp Behavioral-Level Design via Agentic Human-Mimicking Reasoning
White-Op uses LLM agents for interpretable op-amp behavioral design via formalized symbolic reasoning, pole-zero handling, and iterative simulation-based refinement, succeeding on all 9 tested topologies with 8.52% average error where black-box methods fail.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
-
Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing
Q-Probe introduces the first agentic IQA framework that scales to high resolutions using context-aware probing, a new Vista-Bench benchmark, and three-stage training to reach state-of-the-art performance across scales.
-
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
-
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation
MENT benchmark plus RATE agentic evaluator raise combined system- and segment-level correlation with human judgments by at least 3.2 points over prior MT metrics and LLM judges.
-
OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
OPT-Engine shows pure-text chain-of-thought reasoning in LLMs loses robustness as optimization complexity grows, external tools fix only local arithmetic, and solver-integrated methods are bottlenecked by automated constraint formulation.
-
GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
GAMBIT constructs gamified instructional traps that decompose harmful visuals and drive MLLMs to reconstruct and answer malicious queries as part of winning a game, achieving over 85% attack success on models including GPT-4o and Gemini 2.5 Flash.
-
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
-
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
-
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
ThinkARM abstracts LLM reasoning traces into Schoenfeld episodes and shows that exploration steps correlate with correctness while efficiency methods selectively suppress evaluative feedback.
-
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
-
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
ART: Articulated Reconstruction Transformer
ART is a category-agnostic transformer that maps sparse multi-state RGB images to per-part 3D geometry, texture, and articulation parameters via learnable part slots.
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
-
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
-
Setting the Stage: Text-Driven Scene-Consistent Image Generation
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
-
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
-
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
LivingSwap is the first video reference-guided face swapping model that uses keyframe conditioning and temporal stitching to preserve source video realism with high fidelity across long sequences.
-
Asking like Socrates: Socrates helps VLMs understand remote sensing images
RS-EoT uses a SocraticAgent self-play system and two-stage RL to train VLMs for genuine iterative reasoning and visual inspection on remote sensing VQA and grounding tasks, achieving SOTA results.
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
-
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
MMTutorBench is the first multimodal benchmark for AI math tutoring with 685 problems, problem-specific rubrics across six dimensions, and evaluations of 12 MLLMs revealing performance gaps versus humans.
-
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.
-
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
-
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
Thought templates derived from training traces and refined via natural-language feedback improve multi-hop reasoning performance in long-context LMs across benchmarks and can be distilled into smaller models.
-
Efficient numeracy in language models through single-token number embeddings
BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
-
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments
APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model outperforming proprietary sources on chest X-ray tasks and a new CXR-MAX benchmark
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
-
Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs
By sharing the B matrix across adapters instead of the A matrix, ALoRA and Fed-ALoRA deliver more balanced performance in multi-task and federated LLM fine-tuning.
-
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
-
StereoFoley: Object-Aware Stereo Audio Generation from Video
StereoFoley is an end-to-end video-to-stereo-audio framework that uses a base generative model fine-tuned on synthetic object-tracked data with panning and distance controls to achieve object-aware spatial sound.
-
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
-
KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems
KompeteAI accelerates AutoML pipeline evaluation 6.9 times and beats prior systems by 3% on MLE-Bench through candidate merging, external RAG, and predictive early scoring.