EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
mega hub Mixed citations
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Mixed citation behavior. Most common role is background (55%).
abstract
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.
Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
citing papers explorer
-
LVSum: A Benchmark for Timestamp-Aware Long Video Summarization
LVSum is a new benchmark for timestamp-aware long video summarization that exposes systematic temporal gaps in existing multimodal large language models.
-
FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer
FREE-Switch dynamically switches LoRA adapters using frequency importance per diffusion step and adds semantic alignment to reduce content drift when merging specialized image generators.
-
Demographic and Linguistic Bias Evaluation in Omnimodal Language Models
Omnimodal models show reduced demographic bias in image and video tasks compared to substantial biases and lower performance in audio tasks.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five other benchmarks.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
-
BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields
BLaDA grounds open-vocabulary language into functional dexterous manipulation via knowledge-guided parsing, triangular localization in 3DGS fields, and keypoint grasp execution.
-
A Decomposition Perspective to Long-context Reasoning for LLMs
Decomposing long-context reasoning into atomic skills, synthesizing targeted pseudo-datasets, and applying RL improves LLM performance on long-context benchmarks by an average of 7.7%.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
-
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
-
Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil
LLMs handle basic arithmetic reliably in Sinhala and Tamil but show clear performance drops on complex math tasks compared to English.
-
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning
Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.
-
Embodied Task Planning via Graph-Informed Action Generation with Large Language Models
GiG uses a Graph-in-Graph architecture with GNN-encoded states, experience memory retrieval, and bounded symbolic lookahead to improve LLM planning on embodied benchmarks with gains up to 37%.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
-
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving
An MLLM interpreter generates concise CDL descriptions from diagrams, enabling an off-the-shelf LLM to solve plane geometry problems competitively after training on only 5.5k examples.
-
Reward-Forcing: Autoregressive Video Generation with Reward Feedback
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
-
LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback
LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.
-
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
-
OneThinker: All-in-one Reasoning Model for Image and Video
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
-
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.
-
DisCEdge: Distributed Context Management for Large Language Models at the Edge
DisCEdge manages LLM context in tokenized form replicated on edge nodes, delivering up to 14.46% faster median responses, 15% lower sync overhead, and 90% smaller client requests versus baselines while ensuring consistency.
-
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.
-
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
-
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.
-
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
SongFormer achieves state-of-the-art strict boundary detection and functional label accuracy in music structure analysis by fusing SSL representations and using learned source embeddings on a new 14k-song corpus and expert benchmark.
-
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
-
What Is The Political Content in LLMs' Pre- and Post-Training Data?
Training data for open LLMs is systematically left-leaning, with pre-training corpora containing more political material than post-training data and model stances aligning with data distributions.
-
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
-
DiscussLLM: Teaching Large Language Models When to Speak
DiscussLLM introduces a two-stage synthetic data pipeline to annotate multi-turn discussions with five intervention types and trains LLMs to time contributions via a silent token or proactive responses.
-
ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge
ChemDFM-R is a chemical reasoning LLM trained via a four-stage pipeline on the ChemFG dataset of functional-group annotations for molecules and reactions, reaching performance comparable to or better than commercial models on chemical benchmarks.
-
CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning
CoRe combines VLM-designed formal rewards with VLM-labeled residual rewards to produce preference-aligned policies on robotic manipulation tasks.
-
SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search
SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.
-
Uncertainty-Aware Generation and Decision-Making Under Ambiguity
Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal prediction providing guarantees.
-
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.
-
MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control
MAVIN proposes boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and the MAVINSet dataset as the first framework for multi-shot audio-visual generation with narrative control, claiming SOTA results.
-
Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation
Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.
-
GROVE: Grounded Pedestrian Simulation via Natural Language for Interactive Social Robot Navigation
GROVE combines multiple state-of-the-art modules tuned by user prompts to produce realistic long-, medium-, and short-horizon pedestrian behaviors integrated into robot simulators.
-
S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing
S1-Omni-Image unifies scientific image understanding, generation and editing via a think-before-generate paradigm on top of S1-VL-32B, trained on a 314K-sample SciGenEdit dataset, and reports SOTA results on multiple generation and editing benchmarks.
-
UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation
UniTranslator adds an Understand-Generation Alignment Module and Spatial Mask Decoder to a unified multimodal model to fix translation inconsistency and spatial misalignment in in-image machine translation, reporting SOTA results on multiple benchmarks.
-
Token-to-Token Alignment of Text Embeddings for Semantic Blending
Token-to-Token alignment rephrases prompts into shared structure then matches token embeddings by semantic similarity, making linear interpolation a meaningful operation for blending in text-to-image models.
-
Music Playlist Captioning at Scale with Large Language Models
Deezer deployed an LLM-driven playlist captioning system in 2025 for its Daily Mix recommendations, claiming significant gains in user engagement from the added natural-language descriptions.
-
Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages
Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.
-
ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots
ASTRA automates simpilot roles in ATCO training with a fine-tuned ASR pipeline that cuts WER to 23.45% on Singaporean aviation speech and an AI evaluator scoring 86.9-91.7% on accuracy, brevity, and completeness.
-
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
-
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.