mega hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (54%).

1001 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 1001 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 98 baseline 51 method 23 dataset 3

citation-polarity summary

background 94 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

cs.AI · 2026-06-06 · unverdicted · novelty 8.0

UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

A Cost-Aware, Paired Protocol for Auditing Dynamic Tool Synthesis in Agentic Video Question Answering

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

Introduces a cost-aware paired protocol with six outcome groups and applies it to Dynamic-SAGE versus SAGE, reporting 7.5-point accuracy gain, 28% fewer tool calls, but 34% higher token use.

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.

(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

Identifies Screen Perception and Misused Channel attack surfaces in VLM-powered mobile agents and demonstrates seven attacks enabling arbitrary command execution on five frameworks without privileges.

citing papers explorer

Showing 50 of 181 citing papers after filters.

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information cs.CL · 2025-05-27 · unverdicted · none · ref 12 · internal anchor
FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation cs.CL · 2025-05-21 · unverdicted · none · ref 15 · internal anchor
MTR-Bench is a new automated benchmark for multi-turn reasoning in LLMs covering diverse tasks and difficulty levels with 3600 instances.
Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature? cs.CL · 2025-02-11 · unverdicted · none · ref 25 · internal anchor
Evaluation of 22 LLMs shows they are more susceptible to spin in medical abstracts than humans but can recognize and mitigate it when prompted.
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning cs.CL · 2026-06-30 · unverdicted · none · ref 170 · internal anchor
CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.
When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs cs.CL · 2026-06-29 · unverdicted · none · ref 105 · internal anchor
Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 20 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
Thinking Like a Scientist? A Structural Study of LLM-Generated Research Methods cs.CL · 2026-06-15 · unverdicted · none · ref 61 · internal anchor
LLMs given only research questions from 1000 arXiv CS papers recommend a narrower set of methods than the original papers, with effective model-entity diversity dropping from 1232 to 59-96 and stronger agreement among LLMs than with papers.
GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs cs.CL · 2026-06-10 · unverdicted · none · ref 58 · internal anchor
GraspLLM extracts dataset-agnostic structural patterns via motif contrastive learning and aligns contextual subgraphs to LLM tokens, outperforming prior LLM-based methods on TAGs especially in zero-shot settings.
IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking cs.CL · 2026-06-08 · unverdicted · none · ref 21 · internal anchor
IS-CoT framework interleaves planning, writing, and reflection in LLMs to prevent length collapse, yielding IS-Writer-8B that outperforms larger models on long-form benchmarks with better length compliance.
Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating cs.CL · 2026-06-08 · unverdicted · none · ref 56 · internal anchor
Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.
PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus cs.CL · 2026-06-08 · unverdicted · none · ref 14 · internal anchor
PACT combines privileged multi-paradigm dialogue synthesis from EMRs with consensus aggregation of paradigm-specific LoRA branches to reach SOTA on a new Chinese interactive medical diagnosis benchmark.
When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer cs.CL · 2026-06-04 · unverdicted · none · ref 12 · internal anchor
RidgeFT enables replay-free lifelong MGT attribution via frozen encoder, class-wise sufficient statistics, covariance calibration, and closed-form ridge regression updates, outperforming baselines on macro-F1 and retention-adaptation balance.
Noisy memory encoding explains negative polarity illusions cs.CL · 2026-06-03 · unverdicted · none · ref 72 · internal anchor
Noisy memory encoding of determiners explains negative polarity illusions, with new acceptability experiments showing stronger illusions for similar determiner pairs.
Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA cs.CL · 2026-06-02 · unverdicted · none · ref 52 · internal anchor
Introduces DOSEBENCH benchmark and shows four LLMs often fail at rolling 24-hour dose calculations and constraint adherence in OTC dosing decisions despite appearing confident.
SimSD: Simple Speculative Decoding in Diffusion Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 26 · internal anchor
SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.
Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark cs.CL · 2026-06-01 · unverdicted · none · ref 38 · internal anchor
Explicit gender cues induce bounded but systematic decision flips in LLMs on value trade-offs, with self-attributions frequently denying gender influence.
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization cs.CL · 2026-05-31 · unverdicted · none · ref 27 · internal anchor
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence cs.CL · 2026-05-30 · conditional · none · ref 38 · internal anchor
Parameter-based knowledge editing in LLMs induces reasoning collapse via dimensional collapse and is consistently outperformed by a retrieval baseline across varied edit counts, knowledge complexity, and evaluation metrics.
EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation cs.CL · 2026-05-28 · unverdicted · none · ref 34 · internal anchor
EvoRubric is a single-policy RL method that co-evolves a reasoner and a rubric generator with multi-level verification to produce dynamic rewards for open-ended LLM alignment.
From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals cs.CL · 2026-05-28 · unverdicted · none · ref 26 · internal anchor
MaterEval generates paired informed and blind evaluations as preference signals to improve small open-source LLMs on high-entropy alloy assessment, approaching closed-source performance without external retrieval.
ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering cs.CL · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
ConRAG is a new RAG framework that optimizes query and corpus sides using consensus across relation, entity, and text views to deliver up to 26.9% gains over vanilla RAG on multi-hop QA benchmarks.
Learning When to Think While Listening in Large Audio-Language Models cs.CL · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
A wait-think-answer controller for LALMs is trained via SFT followed by six-reward DAPO, raising row-weighted accuracy from 67.6% to 70.3% and cutting post-endpoint thinking length by 14% on synthetic spoken QA while remaining functional on real recorded audio.
PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions cs.CL · 2026-05-26 · unverdicted · none · ref 14 · internal anchor
PersLitEval benchmark shows LLMs perform better on conceptual Persian literature tasks than spelling or word formation, with explained few-shot prompting yielding the strongest results across six models.
SEAL: Synergistic Co-Evolution of Agents and Learning Environments cs.CL · 2026-05-23 · unverdicted · none · ref 50 · internal anchor
SEAL co-evolves LLM agents and environments via shared turn-level failure diagnoses, yielding +8.25 to +26.25 point gains on tool-use tasks with only 400 samples.
Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs cs.CL · 2026-05-21 · unverdicted · none · ref 97 · internal anchor
An image-semantic guided method enhances MLLMs for detecting AI-generated modern Chinese poetry by combining poem text with visual representations of content, achieving 85.65% Macro-F1 with Gemini and outperforming text baselines and RoBERTa.
Token-weighted Direct Preference Optimization with Attention cs.CL · 2026-05-21 · unverdicted · none · ref 32 · internal anchor
AttentionPO weights tokens in DPO using LLM attention as a pairwise judge, yielding better results on AlpacaEval, MT-Bench, and ArenaHard than prior preference optimization methods.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate cs.CL · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 67 · internal anchor
AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.
VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems cs.CL · 2026-05-17 · unverdicted · none · ref 10 · internal anchor
VerifyMAS improves failure attribution in LLM multi-agent systems via hypothesis verification on full trajectories, error taxonomy-based data construction, and fine-tuned verifier models, outperforming prior direct-prediction methods on Aegis-Bench and Who&When.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment cs.CL · 2026-05-17 · unverdicted · none · ref 46 · internal anchor
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering cs.CL · 2026-05-15 · conditional · none · ref 18 · internal anchor
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
DocAtlas: Multilingual Document Understanding Across 80+ Languages cs.CL · 2026-05-12 · unverdicted · none · ref 49 · 2 links · internal anchor
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
Scalable Token-Level Hallucination Detection in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reasoning models.
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs cs.CL · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents cs.CL · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions cs.CL · 2026-05-11 · unverdicted · none · ref 17 · 2 links · internal anchor
LLMs exhibit Pseudo-Deliberation where explicit reasoning fails to align stated values with generated actions, measured via the new VALDI framework across 4,941 scenarios in five domains.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 14 · internal anchor
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
Learning Agent Routing From Early Experience cs.CL · 2026-05-08 · unverdicted · none · ref 50 · internal anchor
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL · 2026-05-08 · unverdicted · none · ref 17 · 2 links · internal anchor
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 38 · internal anchor
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences cs.CL · 2026-05-07 · unverdicted · none · ref 37 · internal anchor
IRC-Bench is a new dataset and evaluation framework for implicit entity recognition in reminiscence narratives, where entities must be inferred from non-local contextual cues across 1,994 transcripts linked to 12,337 WikiData entities.
Milestone-Guided Policy Learning for Long-Horizon Language Agents cs.CL · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety cs.CL · 2026-05-03 · unverdicted · none · ref 37 · internal anchor
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
LLMs Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation cs.CL · 2026-04-26 · unverdicted · none · ref 4 · internal anchor
BUA aligns LLMs to behavior data via sequence embeddings and a three-stage curriculum, outperforming prior methods on prediction and generation in two real-world datasets.
Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake cs.CL · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
An LLM-guided adaptive policy outperforms fixed clinical intake forms and random questioning at recovering target information from synthetic psychiatric patients in 300 simulated sessions.
Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue cs.CL · 2026-04-22 · unverdicted · none · ref 35 · internal anchor
Incremental visual scaffolding using multimodal models improves persistent common ground representation in situated dialogue by reducing representational blur compared to text-only approaches, with hybrid text-visual yielding best results on the IndiRef benchmark.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning cs.CL · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages cs.CL · 2026-04-20 · conditional · none · ref 8 · internal anchor
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy cs.CL · 2026-04-19 · unverdicted · none · ref 3 · internal anchor
Prompts with higher semantic density per token improve LLM accuracy by an average of 8.4 percentage points across five models and seven benchmarks, with no added tokens or latency.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

mega hub controls

Recognition alignment

counterfactual ablation

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer