super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

813 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 813 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation

cs.CV · 2026-06-29 · conditional · novelty 7.0

Introduces EPIC-Contact dataset and HOPformer transformer for in-the-wild egocentric 3D hand-object pose estimation, reporting 82.4% success on ARCTIC and doubled success with 75% lower contact error on the new dataset.

citing papers explorer

Showing 50 of 71 citing papers after filters.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 5 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 15 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance cs.LG · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.
Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models cs.LG · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
Reconstruction-based EEG foundation models preferentially encode aperiodic and low-frequency components over oscillatory structure, with embeddings capturing subject identity more than task-relevant information.
Not only where, But when: Temporal Scheduling for RLVR cs.LG · 2026-05-25 · unverdicted · none · ref 1 · internal anchor
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization cs.LG · 2026-05-18 · unverdicted · none · ref 29 · internal anchor
An LLM-based evolutionary search discovers novel kernels for high-dimensional Bayesian optimization, achieving an average rank of 1.2 out of 17 on five benchmarks via two-stage proposal and LOO-CRPS selection.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation cs.LG · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image cs.LG · 2026-05-11 · unverdicted · none · ref 14 · internal anchor
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 21 · internal anchor
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight cs.LG · 2026-05-08 · unverdicted · none · ref 43 · internal anchor
A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 32 · 2 links · internal anchor
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling cs.LG · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 9 · internal anchor
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling cs.LG · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
Bayesian Preference Learning for Test-Time Steerable Reward Models cs.LG · 2026-02-09 · unverdicted · none · ref 2 · internal anchor
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
Efficient numeracy in language models through single-token number embeddings cs.LG · 2025-10-08 · unverdicted · none · ref 8 · internal anchor
BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs cs.LG · 2025-09-29 · conditional · none · ref 9 · internal anchor
By sharing the B matrix across adapters instead of the A matrix, ALoRA and Fed-ALoRA deliver more balanced performance in multi-task and federated LLM fine-tuning.
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning cs.LG · 2025-08-28 · unverdicted · none · ref 12 · internal anchor
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration cs.LG · 2025-02-03 · unverdicted · none · ref 9 · internal anchor
FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs cs.LG · 2026-06-17 · unverdicted · none · ref 92 · internal anchor
FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate cs.LG · 2026-05-29 · unverdicted · none · ref 209 · internal anchor
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models cs.LG · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
Entropy-based test-time compute (ETTC) in VLM ensembles outperforms majority voting by prioritizing high-confidence predictions from stronger models.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 9 · internal anchor
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 11 · internal anchor
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 21 · 2 links · internal anchor
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction cs.LG · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 25 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Optimal Transport for LLM Reward Modeling from Noisy Preference cs.LG · 2026-05-07 · unverdicted · none · ref 79 · internal anchor
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors cs.LG · 2026-05-01 · conditional · none · ref 9 · internal anchor
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
The Power of Order: Fooling LLMs with Adversarial Table Permutations cs.LG · 2026-05-01 · unverdicted · none · ref 10 · 2 links · internal anchor
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 29 · internal anchor
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Temporally Extended Mixture-of-Experts Models cs.LG · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving cs.LG · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
LoReC: Rethinking Large Language Models for Graph Data Analysis cs.LG · 2026-04-20 · unverdicted · none · ref 8 · internal anchor
LoReC enhances LLMs for graph tasks via attention redistribution, graph re-injection into FFN, and logit rectification, yielding improvements over GraphLLM and GNN baselines on diverse datasets.
TOPCELL: Topology Optimization of Standard Cell via LLMs cs.LG · 2026-04-15 · unverdicted · none · ref 8 · internal anchor
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
MoE Routing Testbed: Studying Expert Specialization and Routing Behavior at Small Scale cs.LG · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
A new testbed for MoE models uses domain-based ideal routing as benchmark to quantify expert specialization and compare routing methods at small scale.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 9 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 3 · internal anchor
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL cs.LG · 2026-02-03 · unverdicted · none · ref 11 · internal anchor
PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.
TRINITY: An Evolved LLM Coordinator cs.LG · 2025-12-04 · unverdicted · none · ref 3 · internal anchor
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management cs.LG · 2025-11-02 · unverdicted · none · ref 4 · internal anchor
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization cs.LG · 2025-10-29 · unverdicted · none · ref 4 · internal anchor
SemanticOpt fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to jointly use numerical and semantic evidence for black-box optimization.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 4 · internal anchor
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 2 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Experience Augmented Policy Optimization for LLM Reasoning cs.LG · 2026-06-29 · unverdicted · none · ref 3 · internal anchor
EAPO reuses prior RL policy experience adaptively at decision points in LLM rollouts with adapted importance sampling and reports gains over prior RLVR methods on math benchmarks.
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification cs.LG · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer