super hub Mixed citations

OpenAI GPT-5 System Card

· 2025 · cs.CL · arXiv 2601.03267

Mixed citation behavior. Most common role is background (51%).

441 Pith papers citing it

Background 51% of classified citations

open full Pith review browse 441 citing papers arXiv PDF

abstract

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 43 baseline 23 method 7 dataset 3 other 3

citation-polarity summary

background 40 baseline 23 use method 7 unclear 5 use dataset 3 support 1

claims ledger

abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 8.0

AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

MSQA benchmark shows LLMs exhibit cultural degradation and a locality effect where competence tracks pre-training exposure more than reasoning, and common inference-time fixes do not resolve it.

Think While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OP3DSG generates unified part-aware open-vocabulary 3D scene graphs via knowledge-guided detection, 3D fusion, and LLM-refined prior graphs, with a new UniGraph3D benchmark showing SOTA results for robotics tasks.

Metadata, Structure, or Strategy? A Decomposition of RAG Context Enrichment

cs.IR · 2026-06-28 · unverdicted · novelty 7.0

Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.

An AI agent for treatment reasoning over a biomedical tool universe

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

ATHENA-R1 is an RL-trained agent using 212 biomedical tools that achieves 94.7% accuracy on drug reasoning and 82.9% on treatment reasoning tasks, outperforming GPT-5 by 17.8 and 10.7 points respectively.

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

cs.SE · 2026-06-11 · unverdicted · novelty 7.0

Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.

citing papers explorer

Showing 50 of 83 citing papers after filters.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot cs.AI · 2026-04-15 · conditional · none · ref 42 · internal anchor
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation cs.AI · 2026-06-04 · accept · none · ref 24 · internal anchor
Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
An AI agent for treatment reasoning over a biomedical tool universe cs.AI · 2026-06-27 · unverdicted · none · ref 14 · internal anchor
ATHENA-R1 is an RL-trained agent using 212 biomedical tools that achieves 94.7% accuracy on drug reasoning and 82.9% on treatment reasoning tasks, outperforming GPT-5 by 17.8 and 10.7 points respectively.
DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions cs.AI · 2026-06-04 · unverdicted · none · ref 13 · internal anchor
DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
JobBench: Aligning Agent Work With Human Will cs.AI · 2026-05-25 · unverdicted · none · ref 24 · internal anchor
JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design cs.AI · 2026-05-19 · conditional · none · ref 48 · 2 links · internal anchor
EngiAI introduces a LangGraph-based multi-agent framework and a three-part benchmark suite for LLM-driven engineering design, reporting high task completion rates for proprietary models on Beams2D and Photonics2D problems.
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents cs.AI · 2026-05-15 · conditional · none · ref 14 · internal anchor
ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces cs.AI · 2026-05-14 · unverdicted · none · ref 46 · internal anchor
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs cs.AI · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Omnimodal LLMs encode premise-perception mismatches in hidden states yet almost never reject false textual claims, exposing a representation-action gap that is modality-asymmetric and prompt-resistant.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 30 · 2 links · internal anchor
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification cs.AI · 2026-05-08 · conditional · none · ref 28 · internal anchor
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks cs.AI · 2026-05-05 · unverdicted · none · ref 61 · internal anchor
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 56 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
A systematic evaluation of vision-language models for observational astronomical reasoning tasks cs.AI · 2026-04-27 · accept · none · ref 33 · internal anchor
Vision-language models underperform specialized astronomical methods on real observational data, with accuracy improving when physical explanations are provided in prompts and when raw numerical measurements replace rendered plots.
Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 82 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics cs.AI · 2026-04-20 · conditional · none · ref 45 · internal anchor
TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.
PersonalHomeBench: Evaluating Agents in Personalized Smart Homes cs.AI · 2026-04-18 · unverdicted · none · ref 4 · internal anchor
PersonalHomeBench is a new benchmark showing that AI agents suffer systematic performance drops in personalized smart homes as task complexity rises, especially in counterfactual reasoning and partial observability.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems cs.AI · 2026-04-06 · unverdicted · none · ref 27 · internal anchor
MMORF provides a modular multi-agent framework for multi-objective retrosynthesis planning, with MASIL and RFAS systems showing strong safety, cost, and success metrics on a new 218-task benchmark.
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation cs.AI · 2026-02-19 · unverdicted · none · ref 33 · internal anchor
Conv-FinRe is a new benchmark built from real market data and human trajectories that tests LLMs on generating utility-grounded stock rankings over fixed horizons while distinguishing rational analysis from behavioral mimicry or momentum.
Open Problems in Constitutional Preference Reconstruction cs.AI · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.
MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy cs.AI · 2026-06-26 · unverdicted · none · ref 2 · internal anchor
MER-R1 uses dual-objective RL to optimize fast-thinking recall and slow-thinking precision separately in multimodal emotion recognition, with calibration to align them, yielding SOTA results on two benchmarks.
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation cs.AI · 2026-06-10 · unverdicted · none · ref 8 · internal anchor
Pythagoras-Prover family achieves 93.0% on MiniF2F-Test with a 32B model and has its 4B version surpass a 671B prior model at pass@32, using ALF data augmentation and curriculum training.
Benchmark Everything Everywhere All at Once cs.AI · 2026-06-04 · unverdicted · none · ref 36 · internal anchor
Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents cs.AI · 2026-06-04 · unverdicted · none · ref 29 · internal anchor
MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification cs.AI · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
Sci-PRM is a tool-aware process reward model trained on the SCIPRM70K dataset to provide fine-grained supervision for scientific reasoning and shown to boost foundation models via Best-of-N selection and RL.
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches cs.AI · 2026-05-31 · unverdicted · none · ref 255 · 2 links · internal anchor
A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision cs.AI · 2026-05-31 · unverdicted · none · ref 17 · internal anchor
SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs cs.AI · 2026-05-28 · unverdicted · none · ref 84 · internal anchor
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models cs.AI · 2026-05-28 · unverdicted · none · ref 31 · internal anchor
Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.
DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation cs.AI · 2026-05-28 · unverdicted · none · ref 56 · internal anchor
DeepSurvey introduces an agentic system for automated survey generation that improves depth through full-text keynotes, cross-paper clustering, and code analysis, while boosting citation reliability via graph expansion, hybrid filtering, and evidence-constrained assignment, with reported gains over
Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling cs.AI · 2026-05-28 · unverdicted · none · ref 22 · internal anchor
RACE-Sched is an asynchronous dual-stream agent framework combining low-latency symbolic heuristics with parallel LLM-based rule synthesis and sandbox validation for dynamic flexible job shop scheduling.
A Unified Framework for the Evaluation of LLM Agentic Capabilities cs.AI · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
A unified framework standardizes LLM agent benchmarks and demonstrates through 400K rollouts that scaffold and environment choices materially alter outcomes, enabling separation of intrinsic capabilities from artifacts.
Revealing Algorithmic Deductive Circuits for Logical Reasoning cs.AI · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
Causal mediation analysis on symbolic-aided CoT prompting localizes ~3% of attention heads to factual/rule retrieval for sub-tasks and higher layers to integration of graph-traversal strategies.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 38 · internal anchor
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework cs.AI · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.
ALSO: Adversarial Online Strategy Optimization for Social Agents cs.AI · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
ALSO frames social agent interactions as an adversarial bandit problem with a neural reward predictor to enable online strategy optimization in non-stationary multi-agent simulations.
Classifier Context Rot: Monitor Performance Degrades with Context Length cs.AI · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
Frontier LLMs miss dangerous actions in long coding agent transcripts 2-30 times more often after hundreds of thousands of benign tokens.
Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics cs.AI · 2026-05-12 · unverdicted · none · ref 55 · internal anchor
In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving cs.AI · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
A Cascaded Generative Approach for e-Commerce Recommendations cs.AI · 2026-05-11 · unverdicted · none · ref 11 · 2 links · internal anchor
A cascaded generative merchandising framework with placement theme generation, constrained keyword generation, and teacher-student fine-tuning achieves a 2.7% lift in cart adds per page view over a strong baseline in online e-commerce experiments.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 42 · 3 links · internal anchor
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 48 · internal anchor
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation cs.AI · 2026-05-10 · unverdicted · none · ref 28 · internal anchor
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 92 · 3 links · internal anchor
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents cs.AI · 2026-05-06 · unverdicted · none · ref 7 · internal anchor
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models cs.AI · 2026-05-02 · unverdicted · none · ref 10 · internal anchor
GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.

OpenAI GPT-5 System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer