super hub Mixed citations

OpenAI GPT-5 System Card

· 2025 · cs.CL · arXiv 2601.03267

Mixed citation behavior. Most common role is background (51%).

358 Pith papers citing it

Background 51% of classified citations

open full Pith review browse 358 citing papers arXiv PDF

abstract

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 43 baseline 23 method 7 dataset 3 other 3

citation-polarity summary

background 40 baseline 23 use method 7 unclear 5 use dataset 3 support 1

claims ledger

abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 8.0

AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Introduces MUTATE benchmark for path-level and action-level divergent thinking in LLM agents and ReDNA method that decouples divergent generation from convergent selection to improve performance.

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

SD-MIA is a black-box membership inference attack that detects pre-training data in diffusion models via cross-modal perturbations on images and textual instructions.

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Introduces EHR-ReasonCon benchmark with expert annotations and EHR-Inspector LLM framework for reasoning-intensive verification of consistency between clinical notes and structured tables in EHRs.

JobBench: Aligning Agent Work With Human Will

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.

citing papers explorer

Showing 50 of 358 citing papers.

Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality cs.CL · 2026-05-03 · unverdicted · none · ref 2 · internal anchor
Exploration-Commitment Decoupling instantiated as Calibration-Aware Generation improves long-form factuality by up to 13% and reduces decoding time by up to 37% on five benchmarks.
Heterogeneous Scientific Foundation Model Collaboration cs.AI · 2026-04-30 · unverdicted · none · ref 56 · internal anchor
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training cs.DC · 2026-04-21 · unverdicted · none · ref 37 · internal anchor
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
Do Emotions Influence Moral Judgment in Large Language Models? cs.CL · 2026-04-21 · unverdicted · none · ref 7 · internal anchor
Inducing emotions shifts LLM moral judgments in a valence-dependent manner that reverses decisions in up to 20% of cases and does not appear in humans.
On the Reliability of Computer Use Agents cs.AI · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Reliability of computer-use agents depends on task specification clarity and consistency of agent behavior across repeated executions.
Logic-Based Verification of Task Allocation for LLM-Enabled Multi-Agent Manufacturing Systems cs.MA · 2026-04-18 · unverdicted · none · ref 21 · internal anchor
A verification layer based on temporal logic and discrete event systems ensures that LLM-generated task allocations in multi-robot manufacturing remain safe.
Beyond Distribution Sharpening: The Importance of Task Rewards cs.LG · 2026-04-17 · unverdicted · none · ref 37 · internal anchor
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation cs.SE · 2026-04-17 · unverdicted · none · ref 40 · internal anchor
REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use cs.CL · 2026-04-17 · unverdicted · none · ref 1 · internal anchor
FD-NL2SQL is a feedback-driven clinical NL2SQL system that decomposes questions, retrieves exemplars via embeddings, synthesizes SQL, and expands its example bank from user edits plus logic-based mutations to improve without new annotations.
Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction cs.CL · 2026-04-12 · unverdicted · none · ref 7 · internal anchor
BERT embeddings encode narrative dimensions of time, space, causality, and character at the token level, as a linear probe achieves 94% accuracy versus 47% on variance-matched random embeddings, though unsupervised clusters do not align with these categories.
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models cs.CV · 2026-04-12 · unverdicted · none · ref 41 · internal anchor
HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.
Beyond Imperfect Alternatives with Rulemapping: A Neuro-Symbolic Case Study on Online Hate Speech cs.CY · 2026-04-10 · unverdicted · none · ref 48 · internal anchor
Rulemapping uses expert symbolic scaffolds to constrain LLMs, raising precision on §130(1) German hate-speech classification from 0.34-0.49 to 0.80-0.86 while preserving recall of 0.82-0.89.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV · 2026-04-09 · unverdicted · none · ref 30 · internal anchor
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO · 2026-04-07 · unverdicted · none · ref 48 · internal anchor
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios cs.DC · 2026-03-10 · unverdicted · none · ref 19 · internal anchor
ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning cs.CV · 2026-02-10 · unverdicted · none · ref 17 · internal anchor
Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.
MAVEN: Improving Generalization in Agentic Tool Calling cs.AI · 2026-05-29 · unverdicted · none · ref 12 · internal anchor
MAVEN is a modular verification scaffold that lifts an open 120b model's tool-calling accuracy from 48% to 71% on MAVEN-Bench without retraining.
Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs cs.HC · 2026-05-28 · unverdicted · none · ref 54 · internal anchor
Humans exhibit greater source-label bias in logical fallacy judgments than LLMs, which maintain more consistent evaluations regardless of source cues.
PowLU: An Activation Function for Stable Pre-Training of LLMs cs.CL · 2026-05-25 · unverdicted · none · ref 18 · internal anchor
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
A Simple Plug-in for Improving Eviction-Based KV Cache Compression cs.LG · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.
Tracing the ongoing emergence of human-like reasoning in Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 74 · internal anchor
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs cs.CY · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
Vision LLMs achieve high rubric-level accuracy on handwritten math but most errors stem from transcription failures rather than rubric misapplication.
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs cs.LG · 2026-05-18 · unverdicted · none · ref 38 · internal anchor
SELFCI uses complementary self-distillation with two reverse KL divergences to align LLMs to contextual integrity while preserving utility, outperforming RL baselines like GRPO in agentic settings.
An LLM-Based System for Argument Mining cs.CL · 2026-05-13 · unverdicted · none · ref 12 · 2 links · internal anchor
An LLM pipeline converts natural-language arguments into abstract graphs of premises, conclusions, and support/attack/undercut relations, with manual and benchmark evaluations showing adequate recovery of structure.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care cs.AI · 2026-05-08 · unverdicted · none · ref 52 · internal anchor
Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 32 · internal anchor
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Sound Agentic Science Requires Adversarial Experiments cs.AI · 2026-04-23 · unverdicted · none · ref 14 · 2 links · internal anchor
LLM agents in science accelerate plausible analyses but require adversarial experiments to search for falsifying evidence instead of crafting compelling claims.
From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms cs.CV · 2026-04-14 · unverdicted · none · ref 18 · internal anchor
Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems cs.MM · 2026-04-07 · unverdicted · none · ref 33 · internal anchor
DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
Are vision-language models ready to zero-shot replace supervised classification models in agriculture? cs.CV · 2025-12-17 · unverdicted · none · ref 20 · internal anchor
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse cs.CL · 2026-05-04 · unverdicted · none · ref 26 · internal anchor
An LLM ensemble reached 80 macro-F1 on 3-class clarity detection and 59 on 9-class evasion detection, with partial layer unfreezing and multilingual ensembles improving encoder results while enriched context helped only LLMs.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unreviewed · ref 89 · internal anchor
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation cs.CV · 2026-05-21 · unreviewed · ref 44 · internal anchor
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild cs.CL · 2026-05-21 · unreviewed · ref 88 · internal anchor
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design cs.AI · 2026-05-19 · unreviewed · ref 48 · internal anchor
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation cs.CV · 2026-05-18 · unreviewed · ref 34 · internal anchor
Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback cs.GR · 2026-05-17 · unreviewed · ref 25 · internal anchor
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination cs.CV · 2026-05-15 · unreviewed · ref 10 · internal anchor
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models cs.CV · 2026-05-12 · unreviewed · ref 4 · 3 links · internal anchor
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unreviewed · ref 15 · internal anchor
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space cs.CV · 2026-05-11 · unreviewed · ref 28 · internal anchor
WASIL: In-the-Wild Arabic Spoken Interactions with LLMs cs.SD · 2026-05-09 · unreviewed · ref 57 · internal anchor
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging cs.LG · 2026-05-08 · unreviewed · ref 26 · internal anchor
Do Joint Audio-Video Generation Models Understand Physics? cs.SD · 2026-05-08 · unreviewed · ref 33 · internal anchor
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale cs.LG · 2026-05-07 · unreviewed · ref 62 · 2 links · internal anchor
NeuroClaw Technical Report cs.CV · 2026-04-27 · unreviewed · ref 42 · internal anchor
UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text cs.CL · 2026-04-23 · unreviewed · ref 11 · internal anchor
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unreviewed · ref 37 · internal anchor
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization cs.AI · 2026-04-20 · unreviewed · ref 69 · internal anchor

OpenAI GPT-5 System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer