super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

941 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 941 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

citing papers explorer

Showing 50 of 139 citing papers after filters.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 22 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Self-Supervised Theorem Discovery in a Formal Axiomatic System cs.AI · 2026-06-27 · unverdicted · none · ref 33 · internal anchor
A self-supervised agent alternates proof search and theorem extraction in a formal system, discovers tens of thousands of theorems, solves human benchmarks, and boosts LLM proof performance when used as lemmas.
Forecasting Future Behavior as a Learning Task cs.AI · 2026-06-09 · unverdicted · none · ref 45 · internal anchor
Behavior Forecasters trained on LRM trajectories outperform larger models in predicting repeatability and input sensitivity at low cost.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks cs.AI · 2026-06-08 · unverdicted · none · ref 61 · internal anchor
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text cs.AI · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding cs.AI · 2026-06-06 · accept · none · ref 20 · internal anchor
MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.
FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning cs.AI · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
FAM-Bench introduces 2500 nutrition-expert-verified multimodal instances across 13 conditions for dish suitability assessment and comparative ranking tasks.
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields cs.AI · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
OmniMatBench is a new human-calibrated benchmark for multimodal materials-science reasoning that reveals the best evaluated MLLM scores only 0.372 overall.
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora cs.AI · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
VeriTrip is a new benchmark using a Multimodal Retrieval Base and Verifiable Knowledge Base to evaluate evidence-grounded reasoning and factual reliability in travel planning agents over unstructured multimodal web data.
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions cs.AI · 2026-05-26 · unverdicted · none · ref 65 · internal anchor
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games cs.AI · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.
Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network cs.AI · 2026-05-25 · unverdicted · none · ref 5 · internal anchor
Empirical study of EvoMap shows 98% of assets never reused, scores driven by self-reported metadata, and 84% of assets using vacuous validation tests.
Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability cs.AI · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
Introduces Synergistic Faithfulness metric based on Shapley Interaction Index to evaluate cross-modal synergy in VLM explainers, revealing over-reliance on visual salience in existing methods.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 11 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 27 · internal anchor
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean cs.AI · 2026-05-17 · accept · none · ref 6 · internal anchor
CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents cs.AI · 2026-05-13 · unverdicted · none · ref 71 · 2 links · internal anchor
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers cs.AI · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 7 · 2 links · internal anchor
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games? cs.AI · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild cs.AI · 2026-05-10 · conditional · none · ref 46 · 2 links · internal anchor
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition cs.AI · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.
CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers cs.AI · 2026-05-03 · unverdicted · none · ref 9 · internal anchor
CP-SynC uses coordinated LLM agents to generate, validate via synthesized checkers, and select MiniZinc models from natural language, substantially outperforming baselines on a 100-problem benchmark.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 63 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials cs.AI · 2026-04-28 · unverdicted · none · ref 8 · internal anchor
SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation cs.AI · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
CT-FineBench is a QA-based benchmark that evaluates fine-grained factual consistency of generated CT reports by probing specific clinical attributes such as location, size, and margin.
Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 84 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench cs.AI · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror cs.AI · 2026-04-16 · unverdicted · none · ref 27 · internal anchor
MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.
[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI cs.AI · 2026-04-15 · unverdicted · none · ref 14 · internal anchor
ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis cs.AI · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
GeoAgentBench supplies a live execution environment and Plan-and-React architecture that lets tool-using AI agents handle multi-step GIS tasks more robustly than prior static evaluation methods.
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs cs.AI · 2026-04-12 · unverdicted · none · ref 11 · internal anchor
A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contamination, then applies it to create a more diverse lineage-aware dataset.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale cs.AI · 2026-04-11 · conditional · none · ref 10 · internal anchor
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
AI Achieves a Perfect LSAT Score cs.AI · 2026-04-11 · unverdicted · none · ref 9 · internal anchor
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math? cs.AI · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling cs.AI · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 7 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
Evaluating the Search Agent in a Parallel World cs.AI · 2026-03-05 · unverdicted · none · ref 5 · internal anchor
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices cs.AI · 2026-02-25 · conditional · none · ref 7 · internal anchor
ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.
LLM-Assisted Op-Amp Behavioral-Level Design via Agentic Human-Mimicking Reasoning cs.AI · 2026-01-29 · unverdicted · none · ref 14 · internal anchor
White-Op uses LLM agents for interpretable op-amp behavioral design via formalized symbolic reasoning, pole-zero handling, and iterative simulation-based refinement, succeeding on all 9 tested topologies with 8.52% average error where black-box methods fail.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 7 · internal anchor
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
Agile Deliberation: Concept Deliberation for Subjective Visual Classification cs.AI · 2025-12-11 · conditional · none · ref 4 · internal anchor
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner cs.AI · 2025-10-03 · unverdicted · none · ref 8 · internal anchor
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems cs.AI · 2025-08-13 · unverdicted · none · ref 3 · internal anchor
KompeteAI accelerates AutoML pipeline evaluation 6.9 times and beats prior systems by 3% on MLE-Bench through candidate merging, external RAG, and predictive early scoring.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions cs.AI · 2025-05-29 · unverdicted · none · ref 31 · internal anchor
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
Benchmarking Large Language Models on Floating-Point Error Classification cs.AI · 2026-06-30 · unverdicted · none · ref 8 · internal anchor
Introduces InterFLOPBench benchmark and evaluates 14 LLMs on multi-label classification of six floating-point error categories in C code, with top models exceeding 0.88 overall F1 but lower scores on subtle errors like underflow.
Open Problems in Constitutional Preference Reconstruction cs.AI · 2026-06-29 · unverdicted · none · ref 26 · internal anchor
Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer