super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

972 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 972 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

citing papers explorer

Showing 50 of 972 citing papers.

PersonaVLM: Long-Term Personalized Multimodal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 8 · internal anchor
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding cs.AI · 2026-03-19 · unverdicted · none · ref 125 · internal anchor
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation cs.AI · 2026-03-18 · unverdicted · none · ref 2 · internal anchor
Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation cs.AI · 2026-03-16 · unverdicted · none · ref 10 · internal anchor
AdaQE-CG uses context-aware adaptive query expansion and inter-card knowledge transfer from a MetaGAI Pool to generate higher-quality model and data cards than prior methods, validated on the new expert-annotated MetaGAI-Bench.
Logics-Parsing-Omni Technical Report cs.AI · 2026-03-10 · unverdicted · none · ref 6 · internal anchor
Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue cs.CL · 2026-03-06 · unverdicted · none · ref 63 · internal anchor
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 3 · internal anchor
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling cs.SD · 2026-03-05 · unverdicted · none · ref 42 · internal anchor
TW-Sound580K dataset plus Tai-LALM model with dynamic Dual-ASR arbitration lifts localized Taiwanese audio-language accuracy to 49.1% on the TAU benchmark.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy cs.AI · 2026-03-02 · unverdicted · none · ref 16 · internal anchor
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning cs.CL · 2026-03-01 · unverdicted · none · ref 56 · internal anchor
Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant cs.CL · 2026-03-01 · unverdicted · none · ref 15 · internal anchor
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
A Minimal Agent for Automated Theorem Proving cs.AI · 2026-02-27 · unverdicted · none · ref 58 · internal anchor
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
GeoWorld: Geometric World Models cs.CV · 2026-02-26 · unverdicted · none · ref 20 · internal anchor
GeoWorld applies hyperbolic geometry to JEPA world models and introduces geometric reinforcement learning, reporting modest success-rate gains of ~3% and ~2% on 3- and 4-step planning tasks versus V-JEPA 2.
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies cs.CV · 2026-02-23 · unverdicted · none · ref 13 · internal anchor
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation cs.CV · 2026-02-17 · unverdicted · none · ref 31 · internal anchor
MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens cs.CV · 2026-02-12 · unverdicted · none · ref 8 · internal anchor
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
Thinking with Drafting: Optical Decompression via Logical Reconstruction cs.CL · 2026-02-12 · unverdicted · none · ref 12 · internal anchor
Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.
Learning to Configure Agentic AI Systems cs.AI · 2026-02-12 · unverdicted · none · ref 5 · 2 links · internal anchor
ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to budget-matched baselines.
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction cs.CV · 2026-02-09 · unverdicted · none · ref 16 · internal anchor
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 28 · internal anchor
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL cs.LG · 2026-02-03 · unverdicted · none · ref 11 · internal anchor
PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding cs.CV · 2026-01-21 · unverdicted · none · ref 13 · internal anchor
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
EGM: Efficient Visual Grounding Language Models cs.CV · 2026-01-20 · unverdicted · none · ref 8 · internal anchor
EGM enables 8B VLMs to reach 91.4 IoU on RefCOCO at 737 ms latency, outperforming a 235B model at 4320 ms, by substituting volume of mid-quality tokens for model scale.
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos cs.CV · 2026-01-11 · conditional · none · ref 2 · internal anchor
VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning cs.AI · 2026-01-11 · unverdicted · none · ref 3 · internal anchor
ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.
CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation cs.CL · 2025-12-22 · unverdicted · none · ref 3 · internal anchor
CycleChart is a consistency-based framework that organizes chart generation, schema parsing, data parsing, and QA around single data instances to enforce bidirectional semantic alignment and improve cross-task generalization.
FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis cs.CV · 2025-12-19 · conditional · none · ref 8 · internal anchor
FPBench evaluates 20 MLLMs across 8 fingerprint tasks on 7 datasets and shows fine-tuning vision and language encoders improves performance by 7-39%.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding cs.CL · 2025-12-12 · unverdicted · none · ref 6 · internal anchor
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 12 · internal anchor
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding cs.CV · 2025-12-06 · conditional · none · ref 8 · internal anchor
MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.
TRINITY: An Evolved LLM Coordinator cs.LG · 2025-12-04 · unverdicted · none · ref 3 · internal anchor
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory cs.AI · 2025-11-26 · unverdicted · none · ref 10 · internal anchor
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 43 · internal anchor
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning cs.CV · 2025-11-25 · unverdicted · none · ref 13 · internal anchor
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling cs.CV · 2025-11-25 · unverdicted · none · ref 5 · internal anchor
LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding cs.RO · 2025-11-21 · unverdicted · none · ref 9 · internal anchor
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation cs.RO · 2025-11-21 · unverdicted · none · ref 9 · internal anchor
Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition cs.CV · 2025-11-19 · unverdicted · none · ref 48 · internal anchor
Insert In Style is a zero-shot framework that disentangles identity, style, and composition via multi-stage training, masked attention, and prior preservation to enable harmonious cross-domain object insertion in images.
UniSER: A Foundation Model for Unified Soft Effects Removal cs.CV · 2025-11-18 · unverdicted · none · ref 16 · internal anchor
UniSER is a unified diffusion transformer foundation model that removes diverse soft image degradations by training on a large curated dataset of semi-transparent occlusions with fine-grained controls.
Frontier Large Language Models Rival State-of-the-Art Planners cs.AI · 2025-11-12 · conditional · none · ref 5 · internal anchor
Frontier LLMs now match or exceed state-of-the-art classical planners on IPC planning tasks, with Gemini 3.1 Pro solving 245 of 360 tasks versus 234 for the best baseline.
AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting cs.AI · 2025-11-12 · conditional · none · ref 5 · internal anchor
AlphaCast is a training-free LLM framework that performs interactive multi-stage reasoning for time series forecasting by integrating feature extraction, knowledge bases, case libraries, and contextual pools.
Cambrian-S: Towards Spatial Supersensing in Video cs.CV · 2025-11-06 · unverdicted · none · ref 30 · internal anchor
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.
Mathematical exploration and discovery at scale cs.NE · 2025-11-03 · unverdicted · none · ref 71 · internal anchor
AlphaEvolve rediscovered best-known solutions for most of 67 tested math problems and found improved solutions in several cases using LLM-guided evolutionary search.
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management cs.LG · 2025-11-02 · unverdicted · none · ref 4 · internal anchor
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 92 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail cs.RO · 2025-10-30 · conditional · none · ref 11 · internal anchor
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization cs.LG · 2025-10-29 · unverdicted · none · ref 4 · internal anchor
SemanticOpt fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to jointly use numerical and semantic evidence for black-box optimization.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 4 · internal anchor
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models cs.AI · 2025-10-09 · unverdicted · none · ref 6 · internal anchor
Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.
Inferring Dynamic Physical Properties from Video Foundation Models cs.CV · 2025-10-02 · unverdicted · none · ref 5 · internal anchor
Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer