super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

773 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 773 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

Algorithmic Recourse of In-Context Learning for Tabular Data

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

The paper delivers the first theoretical analysis and practical zeroth-order framework for algorithmic recourse under in-context learning for tabular prediction.

citing papers explorer

Showing 50 of 118 citing papers after filters.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 19 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields cs.AI · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
OmniMatBench is a new human-calibrated benchmark for multimodal materials-science reasoning that reveals the best evaluated MLLM scores only 0.372 overall.
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora cs.AI · 2026-05-27 · unverdicted · none · ref 25 · internal anchor
VeriTrip is a new benchmark using a Multimodal Retrieval Base and Verifiable Knowledge Base to evaluate evidence-grounded reasoning and factual reliability in travel planning agents over unstructured multimodal web data.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 31 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 30 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents cs.AI · 2026-05-11 · unverdicted · none · ref 88 · internal anchor
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild cs.AI · 2026-05-10 · conditional · none · ref 44 · 2 links · internal anchor
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design cs.AI · 2026-05-09 · unverdicted · none · ref 57 · internal anchor
AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification cs.AI · 2026-05-08 · conditional · none · ref 27 · internal anchor
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition cs.AI · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.
DataDignity: Training Data Attribution for Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 76 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 7 · internal anchor
A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.
Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios cs.AI · 2026-05-05 · unverdicted · none · ref 38 · internal anchor
ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective cs.AI · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
A rule-generation perspective lets LLMs write programs as rules for data mapping and applies complexity theory to estimate their compositionality, tested on string-to-grid tasks.
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials cs.AI · 2026-04-28 · unverdicted · none · ref 16 · internal anchor
SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies cs.AI · 2026-04-18 · unverdicted · none · ref 30 · internal anchor
LLMs persuade only psychologically susceptible humans on societal issues through trust in AI and emotional appeals, while both sides rely on logical fallacies in roughly one out of every six conversational turns.
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints cs.AI · 2026-04-16 · unverdicted · none · ref 9 · internal anchor
ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis cs.AI · 2026-04-15 · unverdicted · none · ref 19 · internal anchor
GeoAgentBench supplies a live execution environment and Plan-and-React architecture that lets tool-using AI agents handle multi-step GIS tasks more robustly than prior static evaluation methods.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale cs.AI · 2026-04-11 · conditional · none · ref 17 · internal anchor
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
PilotBench: A Benchmark for General Aviation Agents with Safety Constraints cs.AI · 2026-04-10 · unverdicted · none · ref 17 · internal anchor
PilotBench reveals that LLMs follow safety instructions well in flight trajectory prediction but deliver lower numerical precision than traditional forecasters, exposing a precision-controllability tradeoff.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition cs.AI · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 40 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices cs.AI · 2026-02-25 · conditional · none · ref 16 · internal anchor
ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation cs.AI · 2026-02-19 · unverdicted · none · ref 13 · internal anchor
Conv-FinRe is a new benchmark built from real market data and human trajectories that tests LLMs on generating utility-grounded stock rankings over fixed horizons while distinguishing rational analysis from behavioral mimicry or momentum.
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition cs.AI · 2025-11-26 · unverdicted · none · ref 30 · internal anchor
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling cs.AI · 2025-10-16 · unverdicted · none · ref 13 · internal anchor
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines cs.AI · 2025-08-08 · accept · none · ref 14 · internal anchor
GeoLaux is a new benchmark of 2186 long-step geometry problems requiring auxiliary lines, used to evaluate 23 MLLMs and reveal major drops in performance on complex tasks.
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents cs.AI · 2025-06-19 · unverdicted · none · ref 8 · internal anchor
AI agents on OSWorld take 2.7-4.3 times more steps than human trajectories, with latency rising sharply due to repeated large model calls for planning and reflection.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 15 · internal anchor
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 29 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling cs.AI · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
RACE-Sched is an asynchronous dual-stream agent framework combining low-latency symbolic heuristics with parallel LLM-based rule synthesis and sandbox validation for dynamic flexible job shop scheduling.
MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing cs.AI · 2026-05-27 · unverdicted · none · ref 27 · internal anchor
MACReD is a multi-agent collaborative reasoning framework for reaction diagram parsing that reports state-of-the-art F1 scores of 75.2% and 84.6% on the RxnScribe benchmark.
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection cs.AI · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning cs.AI · 2026-05-21 · unverdicted · none · ref 13 · internal anchor
Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows cs.AI · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.
BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation cs.AI · 2026-05-19 · unverdicted · none · ref 58 · internal anchor
BLINKG is a benchmark for evaluating LLMs on mapping input data schemas to ontology concepts for knowledge graph construction, with experiments showing promising but limited performance in complex real-world scenarios.
Harnessing LLM Agents with Skill Programs cs.AI · 2026-05-18 · conditional · none · ref 32 · internal anchor
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance cs.AI · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.
DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping cs.AI · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
DVMap extracts high-consensus demographic groups from survey data and applies structured CoT plus GRPO to align LLMs with pluralistic values, reporting 48.6% accuracy on cross-demographic generalization tests.
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification cs.AI · 2026-05-12 · unverdicted · none · ref 42 · internal anchor
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox cs.AI · 2026-05-11 · unverdicted · none · ref 23 · 2 links · internal anchor
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution cs.AI · 2026-05-11 · unverdicted · none · ref 71 · internal anchor
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 47 · internal anchor
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 7 · internal anchor
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer