EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
mega hub Mixed citations
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Mixed citation behavior. Most common role is background (55%).
abstract
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.
Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
citing papers explorer
-
UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation
UniTranslator adds an Understand-Generation Alignment Module and Spatial Mask Decoder to a unified multimodal model to fix translation inconsistency and spatial misalignment in in-image machine translation, reporting SOTA results on multiple benchmarks.
-
Token-to-Token Alignment of Text Embeddings for Semantic Blending
Token-to-Token alignment rephrases prompts into shared structure then matches token embeddings by semantic similarity, making linear interpolation a meaningful operation for blending in text-to-image models.
-
Music Playlist Captioning at Scale with Large Language Models
Deezer deployed an LLM-driven playlist captioning system in 2025 for its Daily Mix recommendations, claiming significant gains in user engagement from the added natural-language descriptions.
-
Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages
Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.
-
ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots
ASTRA automates simpilot roles in ATCO training with a fine-tuned ASR pipeline that cuts WER to 23.45% on Singaporean aviation speech and an AI evaluator scoring 86.9-91.7% on accuracy, brevity, and completeness.
-
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
-
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.
-
MSUE: Multi-Modal Soccer Understanding Expert
MSUE routes questions via LLM to text/image/video experts and an external KB after VLM-driven data synthesis, achieving 0.95 accuracy on SoccerNet VQA.
-
Language-Driven Cost Optimization for Autonomous Driving
LLM interprets user language to set parameters of a risk-aware MPPI controller, with human-in-the-loop validation for adaptive autonomous driving behavior.
-
Task Robustness via Re-Labelling Vision-Action Robot Data
TREAD augments robotics datasets via VLM-based sub-task generation, video segmentation, and linguistic diversity to improve policy generalization on novel tasks in LIBERO benchmarks.
-
Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework
An evaluation-driven framework for customer support AI agents at Nubank integrates context engineering, LLM judges, and A/B testing to deliver up to 37pp NPS gains and strong offline-online correlation across five production domains.
-
IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment
IEA is a tool-calling VLM for conversational image editing trained in three multitask stages that reports lower pixel distance, higher ROUGE-L, and top user-study rankings versus baselines.
-
Neutrality Bites: Gender Representation in AI-Generated Animal Stories
LLMs exhibit masculine bias when assigning gender to animal characters in generated stories, with neutrality often resulting in erasure of feminine perspectives.
-
Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models
CogAudio-LLM introduces LIME-440K dataset, EIPS chain-of-thought reasoning, and DR-SAPO optimization to address semantic dominance and improve affective responses in audio language models.
-
Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
Early DC component convergence in text-to-image Transformer features causes output homogeneity; selective early attenuation via DAVE improves diversity without retraining or extra cost.
-
UNIVID: Unified Vision-Language Model for Video Moderation
UNIVID generates policy-aware captions for video moderation, reducing violation leakage by 42.7% and overkill rate by 37.0% while replacing over 1,000 policy-specific models with a single backbone.
-
Libra: Efficient Resource Management for Agentic RL Post-Training
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.
-
Self-Distilled Policy Gradient
SDPG combines group-relative verifier advantages, normalized standard deviation, full-vocabulary on-policy self-distillation, and reference-policy KL regularization to improve stability and performance over RLVR and self-distillation baselines in language model RL.
-
Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning
Training-free composed video retrieval pipeline using DINOv3 for candidate selection and video-LLM reasoning achieves 48.78 Recall@1 and 51.48 Recall@5 on the CVPR 2026 challenge test set.
-
WALL-WM: Carving World Action Modeling at the Event Joints
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.
-
Effects of Varying LLM Access on Essay Writing Behavior
Pilot experiment shows limited LLM access maintains higher student ownership and strategic use than unlimited access, with no difference in essay quality.
-
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
SwanSphere introduces a causal autoregressive diffusion transformer architecture with SVAC contrastive learning and ODPO optimization for streaming spatial audio generation from video and text.
-
Geometric Erasure by Contrastive Velocity Matching in Rectified Flows
GEM bridges trajectory-based unlearning and teacher-guided erasure to create a geometric guidance objective for targeted concept suppression in Rectified Flow models.
-
Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs
Humans exhibit greater source-label bias in logical fallacy judgments than LLMs, which maintain more consistent evaluations regardless of source cues.
-
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration
OmniVerifier-M1 is a generalist visual verifier using symbolic outputs for meta-verification and decoupled RL to outperform joint optimization for robust verification and agentic self-correction.
-
Audio-Mind: An Auditable Agentic Framework for Audio Understanding
Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.
-
Measuring Progress Toward AGI: A Cognitive Framework
The paper introduces a 10-faculty Cognitive Taxonomy and a held-out task protocol to generate cognitive profiles for measuring AI progress toward AGI.
-
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.
-
MuChator: Enabling Active Music Discovery via Conversational Music LLMs in Douyin Music
MuChator introduces a three-component MusicLLM system (staged knowledge pre-training, automated triplet instruction tuning, hybrid RM with GRPO) that outperforms Gemini-3-Pro on internal datasets and yields 46.49% higher user active days after deployment on Douyin Music.
-
VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding
VEN-VL introduces an enrich-then-compact visual ensemble MoE approach claiming superior performance-efficiency trade-off in multimodal tasks using fewer condensed visual tokens.
-
VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
A vision-language model for robust image vectorization via rounded polygon primitives and input degradation simulation.
-
Tracing the ongoing emergence of human-like reasoning in Large Language Models
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
-
Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Vision-language models can serve as zero-shot ODD sensors for autonomous driving when using definition-anchored chain-of-thought prompting with persona decomposition.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering
A cascaded LLM pipeline for grounded question answering over electronic health records achieved competitive rankings in the ArchEHR-QA 2026 shared task.
-
LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling
LARA-HPC introduces a validation-first agentic system with dry-run verification and multi-phase refinement that improves robustness of AI-generated DFT workflows on HPC systems.
-
From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms
Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.
-
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
-
ClinQueryAgent: A Conversational Agent for Population Health Management
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
-
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
-
DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images
DietDelta uses vision-language prompts on paired before-and-after RGB images to localize food items, estimate their weights, and compute consumption differences, reporting better results than prior single-image methods on three public datasets.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
-
Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval
Large-scale profiling of recent AI literature shows growth in safety, multimodal reasoning, and agent studies alongside stabilization in neural machine translation and graph methods.
-
AI for Mathematics: Progress, Challenges, and Prospects
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
-
MUSEKG: A Knowledge Graph Over Museum Collections
MuseKG builds a typed knowledge graph over museum collections to support natural-language queries and relation-aware exploration of objects, people, images, and extracted entities.
-
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.
-
Evaluating Reasoning Models for Queries with Presuppositions
Reasoning models achieve only 2-11% higher accuracy than non-reasoning models when handling queries with false presuppositions, failing to challenge 26-42% of them and remaining sensitive to presupposition strength.
-
Audio Editing in the Era of Foundation Models: A Survey
A survey that presents a unified taxonomy of audio editing tasks, summarizes training-based and training-free foundation model approaches, reviews datasets and evaluation protocols, and identifies future challenges.
-
An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages
Empirical study measuring ASR performance gains from synthetic speech augmentation in three Indic languages, varying script sources, synthesis models, and cloned voice counts.