EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
super hub Mixed citations
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Mixed citation behavior. Most common role is background (55%).
abstract
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G
authors
co-cited works
representative citing papers
HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.
Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
citing papers explorer
-
SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization
SemanticOpt fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to jointly use numerical and semantic evidence for black-box optimization.
-
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
-
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.
-
Inferring Dynamic Physical Properties from Video Foundation Models
Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models
AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.
-
Structured In-context Environment Scaling for Large Language Model Reasoning
SIE framework automatically constructs scalable, verifiable reasoning environments from structured data, improving in-domain performance and enabling generalization to out-of-domain math and logic tasks.
-
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
-
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
-
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts
Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.
-
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Nanomind decomposes LMMs into modular bricks mapped to heterogeneous accelerators with TABM zero-copy transfers, fused low-bit kernels, and a battery-aware scheduler, cutting energy 42.3% and enabling 18.8-hour runtime on a 2000 mAh battery for LLaVA-OneVision-Qwen2-0.5B.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
-
VideoAgent: Personalized Synthesis of Scientific Videos
VideoAgent is a modular framework that redefines scientific video synthesis as an intent-driven planning problem and introduces the SciVidEval benchmark for multimodal quality and pedagogical utility.
-
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
-
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
-
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
-
PrinciplismQA: A Philosophy-Grounded Approach to Assessing LLM-Human Clinical Medical Ethics Alignment
PrinciplismQA benchmark reveals significant gaps in LLMs' clinical ethical reasoning despite high knowledge accuracy.
-
Can Large Language Models Really Recognize Your Name?
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.
-
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.
-
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
-
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
-
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
-
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
-
Change-Robust Online Spatial-Semantic Topological Mapping
CROSS replaces globally consistent metric maps with a pose-aware topological graph of RGB-D keyframes and maintains a bounded Gaussian-mixture belief over poses via sequential hypothesis testing in SE(3) to achieve change-robust spatial-semantic mapping and navigation.
-
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
-
LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression
LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.
-
Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases
Agent4cs deploys summarization, keyword-extraction, and quality-assurance agents in a bottom-up pipeline that raises semantic consistency by 8% and normalized keyword coverage by up to 38% over structured prompting baselines on seven frontier models.
-
How Ethos and Pathos Appeals Resonate in Reader Interpretations of Social Media Messages
Analyses of labeled social media sentences and interpretations show 30% divergence in ethos and pathos, greater variability for charged content, and predictive power for audience attitudes toward the author.
-
Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising
SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.
-
Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models
Introduces an OpenMIC-derived multi-axis benchmark sequence showing that high binary instrument QA accuracy fails to predict robust grounding, with models showing position bias, confusable errors, and temporal bias.
-
On the Faithfulness of Post-Hoc Concept Bottleneck Models
Post-hoc CBMs produce unfaithful concept projections due to covariate shifts and systematic label noise; new metrics are introduced to measure faithfulness separately from accuracy.
-
Experience Augmented Policy Optimization for LLM Reasoning
EAPO reuses prior RL policy experience adaptively at decision points in LLM rollouts with adapted importance sampling and reports gains over prior RLVR methods on math benchmarks.
-
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification
FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.
-
REAR: Test-time Preference Realignment through Reward Decomposition
REAR decomposes the reward into question and preference components, rescales their balance, and expresses the result as a linear combination of token log-probabilities for efficient integration with best-of-N and tree search.
-
Concept Removal Guidance: Evidence-Calibrated Negative Guidance for Safe Diffusion Sampling
CRG adaptively calibrates negative guidance in diffusion models by estimating concept presence from noise predictions at each step to suppress unwanted content while preserving fidelity.
-
The Body as Status: Muscularity, Engagement, and Body Image Risk on #GymTok
Content analysis of #GymTok videos finds positive associations between muscularity, perceived harm, and engagement metrics, implying TikTok algorithms may amplify muscular ideals and risky behaviors.
-
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
-
Mitigating Batch Effects in Histopathology via Language-Mediated Robust Embedding Generation
GLMP generates robust pathology embeddings by routing histology images through an intermediate textual representation produced by general-purpose MLLMs to mitigate batch effects.
-
CogniRoute: Learning to Route Social Evidence in Omni-Modal Models
CogniRoute adds a cognitive schema and route-aware RL to an omni-modal MoE, reaching 59.38% accuracy on a new 118K-example social video QA benchmark and beating prior baselines by 15-27 points.
-
PromptMark: A Prompt-Guided Iterative-Feedback Framework for Source Code Watermarking
PromptMark is a black-box prompt-guided iterative-feedback framework that embeds statistically detectable watermarks in LLM-generated source code via naming patterns while preserving functional correctness.
-
Multi-View Decompilation for LLM-Based Malware Classification
Multi-decompiler prompting improves LLM malware classification F1 by supplying complementary views of the same binary.
-
Qiskit Code Migration with LLMs
A taxonomy-guided RAG system with LLMs reduces hallucinations and improves migration suggestions for Qiskit code compared to unconstrained retrieval.
-
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
LLMs achieve maximum Spearman correlations of 0.152 (direct) and 0.241 (response-based) with human item discrimination values, showing non-random but unreliable signal for distinguishing student proficiency.
-
WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning
WEQA proposes a query-adaptive agent framework combining LLMs with wearable data tools, achieving 24% higher accuracy than baselines on a benchmark from four open datasets, with gains in expert-rated usefulness.
-
From Drift to Coherence: Stabilizing Beliefs in LLMs
In multiple-choice QA, LLM beliefs drift early under repeated sampling but self-stabilize; seed-answer prompting and a self-consistency loss reduce drift while preserving accuracy.
-
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
EnvRL incorporates environment dynamics learning via state prediction and inverse dynamics auxiliary objectives into agentic RL, reporting higher success rates than RL-only baselines on ALFWorld and WebShop.
-
Evaluating Pluralism in LLMs through Latent Perspectives
A domain-agnostic framework extracts perspectives from book reviews showing LLMs underrepresent rarer viewpoints relative to human text.
-
Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation
Introduces PAND dataset for Persian proverbs and reports a persistent decompression gap in LLMs that explicit reasoning partially reduces.