super hub Mixed citations

OpenAI GPT-5 System Card

· 2025 · cs.CL · arXiv 2601.03267

Mixed citation behavior. Most common role is background (51%).

333 Pith papers citing it

Background 51% of classified citations

open full Pith review browse 333 citing papers arXiv PDF

abstract

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 43 baseline 23 method 7 dataset 3 other 3

citation-polarity summary

background 40 baseline 23 use method 7 unclear 5 use dataset 3 support 1

claims ledger

abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 8.0

AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Introduces MUTATE benchmark for path-level and action-level divergent thinking in LLM agents and ReDNA method that decouples divergent generation from convergent selection to improve performance.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

FundusGround is a new benchmark with 10,719 fundus images, 15,595 ETDRS-grid localized lesions, and 72,706 VQA questions to support clinically interpretable ophthalmic visual question answering.

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

Fine-grained Claim-level RAG Benchmark for Law

cs.CL · 2026-05-20 · unverdicted · novelty 7.0 · 6 refs

ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.

citing papers explorer

Showing 50 of 333 citing papers.

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows cs.MA · 2026-05-09 · unverdicted · none · ref 20 · internal anchor
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 21 · internal anchor
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification cs.AI · 2026-05-08 · conditional · none · ref 28 · internal anchor
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval cs.CV · 2026-05-08 · unverdicted · none · ref 47 · internal anchor
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration cs.CR · 2026-05-08 · conditional · none · ref 41 · internal anchor
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
Stateful Agent Backdoor cs.CR · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 17 · 2 links · internal anchor
4DThinker enables VLMs to perform dynamic spatial reasoning by thinking with 4D latent mental imagery using new fine-tuning and reinforcement learning methods.
Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models cs.CV · 2026-05-07 · unverdicted · none · ref 33 · internal anchor
ContactPrompt uses part-wise vertex grids and multi-stage part-conditioned reasoning in MLLMs to achieve training-free dense hand contact estimation that outperforms prior supervised methods.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs cs.DC · 2026-05-05 · unverdicted · none · ref 41 · internal anchor
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks cs.AI · 2026-05-05 · unverdicted · none · ref 61 · internal anchor
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
Segmenting Human-LLM Co-authored Text via Change Point Detection cs.CL · 2026-05-05 · unverdicted · none · ref 11 · internal anchor
Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.
LLM-Foraging: Large Language Models for Decentralized Swarm Robot Foraging cs.RO · 2026-05-02 · unverdicted · none · ref 38 · internal anchor
LLM-Foraging uses off-the-shelf LLMs for decentralized tactical decisions in CPFA-based swarm foraging, collecting more resources than GA-tuned baselines across 36 varied configurations while showing greater consistency.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice cs.CL · 2026-05-02 · unverdicted · none · ref 40 · internal anchor
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 56 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems cs.CR · 2026-05-01 · unverdicted · none · ref 47 · internal anchor
E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-based attacks.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost? cs.CL · 2026-05-01 · unverdicted · none · ref 29 · internal anchor
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR cs.CL · 2026-04-30 · unverdicted · none · ref 26 · internal anchor
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation cs.SE · 2026-04-29 · unverdicted · none · ref 39 · internal anchor
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning cs.RO · 2026-04-28 · unverdicted · none · ref 100 · internal anchor
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 31 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
A systematic evaluation of vision-language models for observational astronomical reasoning tasks cs.AI · 2026-04-27 · accept · none · ref 33 · internal anchor
Vision-language models underperform specialized astronomical methods on real observational data, with accuracy improving when physical explanations are provided in prompts and when raw numerical measurements replace rendered plots.
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs cs.CV · 2026-04-25 · unverdicted · none · ref 27 · internal anchor
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
Evaluating Temporal Consistency in Multi-Turn Language Models cs.CL · 2026-04-24 · unverdicted · none · ref 33 · internal anchor
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 30 · internal anchor
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders cs.LG · 2026-04-21 · unverdicted · none · ref 52 · internal anchor
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training cs.LG · 2026-04-21 · unverdicted · none · ref 29 · internal anchor
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 82 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics cs.AI · 2026-04-20 · conditional · none · ref 45 · internal anchor
TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.
PersonalHomeBench: Evaluating Agents in Personalized Smart Homes cs.AI · 2026-04-18 · unverdicted · none · ref 4 · internal anchor
PersonalHomeBench is a new benchmark showing that AI agents suffer systematic performance drops in personalized smart homes as task complexity rises, especially in counterfactual reasoning and partial observability.
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions cs.CV · 2026-04-17 · conditional · none · ref 46 · internal anchor
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 13 · internal anchor
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Psychological Steering of Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 56 · internal anchor
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search cs.CV · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
ARGOS is the first benchmark reformulating multi-camera person search as an agentic interactive reasoning task grounded in a spatio-temporal topology graph, with 2691 tasks across three tracks where current LLMs achieve low accuracy.
RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games cs.CV · 2026-04-13 · unverdicted · none · ref 22 · internal anchor
RESP uses reference-guided sequential prompting with VLMs to improve frame-level and video-level visual glitch detection in games by establishing per-video baselines.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning cs.LG · 2026-04-12 · unverdicted · none · ref 45 · internal anchor
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding cs.PF · 2026-04-11 · unverdicted · none · ref 1 · internal anchor
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping cs.CV · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
PlantXpert benchmark shows fine-tuned VLMs reach up to 78% accuracy on plant phenotyping but scaling gains plateau and quantitative biological reasoning remains weak.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos cs.CV · 2026-04-10 · unverdicted · none · ref 27 · 2 links · internal anchor
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability cs.CL · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
MedConceal provides 300 cases and a simulator that withholds hidden concerns to evaluate confirmation and intervention in medical dialogue, finding frontier models vary on surfacing concerns while humans outperform on guiding patients to care plans.
ParseBench: A Document Parsing Benchmark for AI Agents cs.CV · 2026-04-09 · accept · none · ref 24 · internal anchor
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CL · 2026-04-07 · unverdicted · none · ref 24 · internal anchor
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems cs.AI · 2026-04-06 · unverdicted · none · ref 27 · internal anchor
MMORF provides a modular multi-agent framework for multi-objective retrosynthesis planning, with MASIL and RFAS systems showing strong safety, cost, and success metrics on a new 218-task benchmark.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning cs.IR · 2026-04-06 · unverdicted · none · ref 53 · internal anchor
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
DeonticBench: A Benchmark for Reasoning over Rules cs.CL · 2026-04-06 · unverdicted · none · ref 29 · internal anchor
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 18 · internal anchor
QAPruner introduces a hybrid sensitivity metric that combines group-wise quantization error simulation and outlier intensity with semantic scores to prune visual tokens, yielding 2.24% higher accuracy than naive baselines at 12.5% token retention on LLaVA models while surpassing dense low-bit models
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset cs.CV · 2026-03-24 · unverdicted · none · ref 62 · internal anchor
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

OpenAI GPT-5 System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer