super hub Mixed citations

OpenAI GPT-5 System Card

· 2025 · cs.CL · arXiv 2601.03267

Mixed citation behavior. Most common role is background (51%).

489 Pith papers citing it

Background 51% of classified citations

open full Pith review browse 489 citing papers arXiv PDF

abstract

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 44 baseline 23 method 7 dataset 3 other 3

citation-polarity summary

background 41 baseline 23 use method 7 unclear 5 use dataset 3 support 1

claims ledger

abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 8.0 · 3 refs

MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 8.0

AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

Seek to Segment: Active Perception for Panoramic Referring Segmentation

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.

DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

DisciplineGen-1M is a million-scale multidisciplinary dataset for text-to-image generation and editing, paired with a discipline-informed model that improves results on discipline-specific benchmarks.

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

AnyGroundBench is a domain-adaptation benchmark for spatio-temporal video grounding across animal, industry, sports, surgery, and public security domains that finds 15 state-of-the-art VLMs fail in zero-shot and ICL settings.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.

Think While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

citing papers explorer

Showing 50 of 114 citing papers after filters.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio cs.CL · 2026-06-15 · unverdicted · none · ref 34 · 3 links · internal anchor
MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries cs.CL · 2026-06-14 · unverdicted · none · ref 30 · internal anchor
EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
FlowCompile: An Optimizing Compiler for Structured LLM Workflows cs.CL · 2026-05-13 · unverdicted · none · ref 33 · internal anchor
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 26 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation cs.CL · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets cs.CL · 2026-07-02 · unverdicted · none · ref 19 · internal anchor
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 92 · internal anchor
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders cs.CL · 2026-06-11 · unverdicted · none · ref 47 · internal anchor
FORGE benchmark shows search-augmented LLMs recommend fake products at rates up to 27% from one polluted page and 73.8% from top-3 replacement across 12 models and 225 products.
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents cs.CL · 2026-06-10 · unverdicted · none · ref 32 · internal anchor
FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on
Where You Inject Diversity Matters: A Unified Framework for Diverse Generation cs.CL · 2026-06-09 · unverdicted · none · ref 1 · internal anchor
A new framework for diverse LLM generation via diversity source characterization and transmission scoring, with specification-level injection outperforming test-time baselines across five tasks and four models.
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding cs.CL · 2026-06-05 · unverdicted · none · ref 44 · internal anchor
UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 33 · internal anchor
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation cs.CL · 2026-06-01 · unverdicted · none · ref 34 · internal anchor
Introduces LongJudgeBench benchmark showing LLM judges remain unstable for long-form output evaluation even with rubrics or references.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence? cs.CL · 2026-05-27 · unverdicted · none · ref 69 · internal anchor
LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.
Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents cs.CL · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
Introduces MUTATE benchmark for path-level and action-level divergent thinking in LLM agents and ReDNA method that decouples divergent generation from convergent selection to improve performance.
Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records cs.CL · 2026-05-26 · unverdicted · none · ref 30 · internal anchor
Introduces EHR-ReasonCon benchmark with expert annotations and EHR-Inspector LLM framework for reasoning-intensive verification of consistency between clinical notes and structured tables in EHRs.
Fine-grained Claim-level RAG Benchmark for Law cs.CL · 2026-05-20 · unverdicted · none · ref 46 · 6 links · internal anchor
ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
HalluScore: Large Language Model Hallucination Question Answering Benchmark cs.CL · 2026-05-16 · unverdicted · none · ref 61 · internal anchor
HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 85 · internal anchor
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 51 · internal anchor
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning cs.CL · 2026-05-11 · unverdicted · none · ref 52 · 2 links · internal anchor
PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but struggle with functional, indirect and weak-support cases.
Segmenting Human-LLM Co-authored Text via Change Point Detection cs.CL · 2026-05-05 · unverdicted · none · ref 11 · internal anchor
Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice cs.CL · 2026-05-02 · unverdicted · none · ref 40 · internal anchor
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost? cs.CL · 2026-05-01 · unverdicted · none · ref 29 · internal anchor
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR cs.CL · 2026-04-30 · unverdicted · none · ref 26 · internal anchor
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
Evaluating Temporal Consistency in Multi-Turn Language Models cs.CL · 2026-04-24 · unverdicted · none · ref 33 · internal anchor
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Psychological Steering of Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 56 · internal anchor
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability cs.CL · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
MedConceal provides 300 cases and a simulator that withholds hidden concerns to evaluate confirmation and intervention in medical dialogue, finding frontier models vary on surfacing concerns while humans outperform on guiding patients to care plans.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CL · 2026-04-07 · unverdicted · none · ref 24 · internal anchor
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
DeonticBench: A Benchmark for Reasoning over Rules cs.CL · 2026-04-06 · unverdicted · none · ref 29 · internal anchor
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios cs.CL · 2026-03-23 · unverdicted · none · ref 2 · internal anchor
MemGround is a new benchmark that evaluates LLMs' long-term memory through gamified tasks assessing surface state, temporal association, and reasoning memory.
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 1 · internal anchor
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models cs.CL · 2026-02-10 · unverdicted · none · ref 18 · internal anchor
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
Know Your Source: A Public Knowledge Store for Media Background Checks cs.CL · 2026-07-02 · unverdicted · none · ref 104 · internal anchor
MEDIAREF is a publicly available knowledge store of documents from 200 media sources that enables low-cost, reproducible evaluation of media background check generation for fact-checking systems.
Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs cs.CL · 2026-06-30 · unverdicted · none · ref 87 · internal anchor
RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.
TRACE: Temporal Relationship-Aware Conversational Entrainment Detection in Dyadic Speech cs.CL · 2026-06-29 · unverdicted · none · ref 16 · internal anchor
Introduces DyadEE dataset and TRACE window-level framework using sequences of acoustic embeddings for emotional entrainment detection, reporting 97.01% accuracy when context and relationship information are included.
Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction cs.CL · 2026-06-26 · unverdicted · none · ref 25 · internal anchor
Epi2Diff extracts cognitive episode sequences from LRM reasoning traces and combines them with semantic features to predict human item difficulty, outperforming baselines on four educational datasets.
AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows cs.CL · 2026-06-16 · unverdicted · none · ref 17 · internal anchor
AIPatient Arena is an EHR-grounded multi-turn evaluation framework for LLMs in clinical consultations that scores models on eight competence dimensions across 437+ patients, finding strengths in questioning and ethics but weaknesses in diagnostic reasoning and ambiguity handling.
SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents cs.CL · 2026-06-11 · unverdicted · none · ref 13 · internal anchor
SkillCAT proposes a three-stage training-free pipeline for LLM agent skill self-evolution using contrastive causal extraction, assessment-augmented merging, and topology-aware execution, reporting up to 40.40% average score gains on agent benchmarks.
When is Your LLM Steerable? cs.CL · 2026-06-10 · unverdicted · none · ref 15 · internal anchor
Early hidden state features from the first few tokens allow a GBDT classifier to predict activation steering success, under-steering, or over-steering with 0.7 macro-F1 on unseen concepts.
Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark cs.CL · 2026-06-09 · conditional · none · ref 5 · internal anchor
A 540-image benchmark with four phrasing variants per image reveals VLMs degrade when text leakage is minimized, with no-image ablations confirming reliance and GRPO post-training yielding gains that transfer to held-out data.
PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus cs.CL · 2026-06-08 · unverdicted · none · ref 29 · internal anchor
PACT combines privileged multi-paradigm dialogue synthesis from EMRs with consensus aggregation of paradigm-specific LoRA branches to reach SOTA on a new Chinese interactive medical diagnosis benchmark.
Arabic Sentence Segmentation Across Genres and Punctuation Conditions cs.CL · 2026-06-06 · unverdicted · none · ref 46 · internal anchor
AraSEG is a genre-diverse Arabic sentence segmentation corpus showing lightweight encoders and dependency parsers outperform LLMs under challenging punctuation while improving downstream parsing.
The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs cs.CL · 2026-06-05 · unverdicted · none · ref 48 · internal anchor
Using a 1PL IRT model on real cultural questions across 13 locales, the study identifies a local-language knowledge-access advantage masked by lower proficiency in raw accuracy.
ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation cs.CL · 2026-06-03 · unverdicted · none · ref 36 · internal anchor
ComplexityMT benchmark finds higher CEFR levels increase translation difficulty and MT systems often shift target CEFR levels versus source texts in most of six languages tested.
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts cs.CL · 2026-06-03 · unverdicted · none · ref 214 · internal anchor
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Investigating and Alleviating Harm Amplification in LLM Interactions cs.CL · 2026-06-01 · unverdicted · none · ref 34 · internal anchor
Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.
What to Format and How: A Benchmark and Workflow Approach for Document Formatting cs.CL · 2026-06-01 · unverdicted · none · ref 12 · internal anchor
Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.
Deep Research as Rubric for Reinforcement Learning cs.CL · 2026-05-31 · unverdicted · none · ref 23 · internal anchor
DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.

OpenAI GPT-5 System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer