super hub Mixed citations

OpenAI GPT-5 System Card

· 2025 · cs.CL · arXiv 2601.03267

Mixed citation behavior. Most common role is background (51%).

358 Pith papers citing it

Background 51% of classified citations

open full Pith review browse 358 citing papers arXiv PDF

abstract

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 43 baseline 23 method 7 dataset 3 other 3

citation-polarity summary

background 40 baseline 23 use method 7 unclear 5 use dataset 3 support 1

claims ledger

abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 8.0

AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Introduces MUTATE benchmark for path-level and action-level divergent thinking in LLM agents and ReDNA method that decouples divergent generation from convergent selection to improve performance.

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

SD-MIA is a black-box membership inference attack that detects pre-training data in diffusion models via cross-modal perturbations on images and textual instructions.

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Introduces EHR-ReasonCon benchmark with expert annotations and EHR-Inspector LLM framework for reasoning-intensive verification of consistency between clinical notes and structured tables in EHRs.

JobBench: Aligning Agent Work With Human Will

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.

citing papers explorer

Showing 50 of 358 citing papers.

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text cs.CL · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
Local calibration of likelihood ratios conditioned on hidden-space position corrects Simpson's paradox in token averaging and substantially improves machine-text detection across baselines.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents cs.AI · 2026-05-06 · unverdicted · none · ref 7 · internal anchor
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV · 2026-05-06 · unverdicted · none · ref 29 · internal anchor
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
Perturbation is All You Need for Extrapolating Language Models stat.ML · 2026-05-05 · unverdicted · none · ref 22 · internal anchor
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts cs.CV · 2026-05-03 · unverdicted · none · ref 33 · internal anchor
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
AI Alignment Amplifies the Role of Race, Gender, and Disability in Hiring Decisions cs.CY · 2026-05-02 · unverdicted · none · ref 38 · internal anchor
Post-training alignment amplifies hiring advantages for female and Black candidates by over 300 percent and disadvantages for disabled candidates by 171 percent, reversing some human discrimination patterns.
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models cs.AI · 2026-05-02 · unverdicted · none · ref 10 · internal anchor
GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
MUDY: Multi-Granular Dynamic Candidate Contextualization for Unsupervised Keyphrase Extraction cs.IR · 2026-05-01 · unverdicted · none · ref 27 · internal anchor
MUDY improves unsupervised keyphrase extraction by combining prompt-based scoring with candidate-aware weighting and self-attention-based multi-granular scoring to capture both local and global contextual salience, outperforming baselines on four datasets.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction cs.CL · 2026-04-30 · unverdicted · none · ref 87 · internal anchor
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 62 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding cs.CV · 2026-04-29 · unverdicted · none · ref 23 · internal anchor
MCM-VG achieves state-of-the-art zero-shot 3D visual grounding on ScanRefer and Nr3D by creating consistent 2D-3D mappings across semantic, geometric, and viewpoint dimensions using LLMs and VLMs.
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation cs.SE · 2026-04-27 · unverdicted · none · ref 19 · internal anchor
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing cs.CR · 2026-04-26 · unverdicted · none · ref 6 · internal anchor
Spore extracts private data from LLM memory with one query in black-box mode or ranked tokens in gray-box, outperforming prior attacks while bypassing defenses.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 23 · internal anchor
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples cs.CV · 2026-04-24 · unverdicted · none · ref 26 · internal anchor
Using contrastive examples with vision-language models and a new CLIP-based scoring method called CSP produces more faithful and granular neuron labels than prior activation-only approaches.
Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design cs.AI · 2026-04-22 · unverdicted · none · ref 37 · internal anchor
Mol-Debate applies multi-agent debate in an iterative loop with perspective orchestration to achieve state-of-the-art text-guided molecular design, scoring 59.82% exact match on ChEBI-20 and 50.52% weighted success on S2-Bench.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning cs.CV · 2026-04-21 · unverdicted · none · ref 56 · internal anchor
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews cs.CL · 2026-04-21 · unverdicted · none · ref 1 · internal anchor
Beyond Rating proposes five text-centric metrics for AI reviewers and demonstrates that aligning AI focus on paper weaknesses with human experts is required for reliable automated review scoring.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 52 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models cs.LG · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
LLM-AUG applies LLM in-context learning for embedding-space data augmentation in wireless ML, outperforming baselines and reaching near-oracle accuracy with only 15% labeled data on RadioML and IC datasets.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 61 · internal anchor
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models cs.AI · 2026-04-18 · unverdicted · none · ref 35 · internal anchor
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap cs.CV · 2026-04-17 · unverdicted · none · ref 1 · internal anchor
VLMs primarily reason in textual space with limited reliance on visual evidence, shown by consistent performance drops when images are added to text in a controlled aligned benchmark.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 2 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Narrative-Driven Paper-to-Slide Generation via ArcDeck cs.AI · 2026-04-13 · unverdicted · none · ref 40 · internal anchor
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays cs.CV · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
GazeVaLM provides 960 gaze recordings from 16 radiologists on 60 chest X-rays (half synthetic) plus LLM predictions for diagnostic accuracy and real-fake detection under matched conditions.
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search cs.DC · 2026-04-13 · unverdicted · none · ref 44 · internal anchor
R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation cs.SE · 2026-04-10 · unverdicted · none · ref 41 · internal anchor
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 45 · internal anchor
UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
SeLaR: Selective Latent Reasoning in Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 36 · internal anchor
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing cs.CL · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
BAIM enriches knowledge tracing item representations by deriving stage-level embeddings from Polya's four problem-solving stages and routing them adaptively per learner context, yielding consistent gains over pretraining baselines on two datasets.
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation cs.CL · 2026-04-09 · unverdicted · none · ref 57 · internal anchor
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
Linear Representations of Hierarchical Concepts in Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 24 · internal anchor
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
Towards Knowledgeable Deep Research: Framework and Benchmark cs.AI · 2026-04-09 · unverdicted · none · ref 29 · internal anchor
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views cs.DB · 2026-04-08 · unverdicted · none · ref 39 · internal anchor
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning cs.LG · 2026-04-07 · unverdicted · none · ref 23 · internal anchor
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.
Exclusive Unlearning cs.CL · 2026-04-07 · unverdicted · none · ref 15 · internal anchor
Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CV · 2026-04-07 · unverdicted · none · ref 37 · internal anchor
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds" cs.CV · 2026-04-07 · unverdicted · none · ref 34 · internal anchor
EchoAgent is a new agentic AI system that integrates visual observation, quantitative measurement, and expert knowledge reasoning to achieve reliable echocardiography interpretation with up to 80% accuracy on CAMUS and MIMIC-EchoQA datasets.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding cs.CV · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 14 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Mitigating LLM biases toward spurious social contexts using direct preference optimization cs.AI · 2026-04-02 · unverdicted · none · ref 25 · internal anchor
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework cs.CL · 2026-04-02 · unverdicted · none · ref 88 · internal anchor
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
From Data to Theory: Autonomous Large Language Model Agents for Materials Science cs.AI · 2026-04-01 · unverdicted · none · ref 21 · internal anchor
An LLM agent autonomously selects, codes, and validates materials equations from data, recovering known laws reliably but requiring checks for new or specialized cases.
Evaluating Privilege Usage of Agents with Real-World Tools cs.CR · 2026-03-30 · unverdicted · none · ref 18 · internal anchor
GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.
AIRA_2: Overcoming Bottlenecks in AI Research Agents cs.AI · 2026-03-27 · conditional · none · ref 19 · internal anchor
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models cs.AI · 2026-03-26 · unverdicted · none · ref 27 · internal anchor
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation cs.AI · 2026-03-18 · unverdicted · none · ref 1 · internal anchor
Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks cs.CV · 2026-03-04 · unverdicted · none · ref 13 · internal anchor
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.

OpenAI GPT-5 System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer