super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

821 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 821 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

eess.AS · 2026-06-27 · unverdicted · novelty 7.0

GigaSpeechBench is a new 680-hour in-the-wild multilingual ASR/AST benchmark with five modules for low-resource languages, Chinese dialects, English accents, domain terminology, and age-varied speech, showing model performance drops.

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

citing papers explorer

Showing 50 of 821 citing papers.

UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning cs.CL · 2026-05-31 · unverdicted · none · ref 40 · internal anchor
UniD³ applies KG-RAG with Llama 3.3-70B to build six knowledge graphs and generate large validated datasets for drug-disease matching, effectiveness assessment, and target analysis from biomedical literature.
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation cs.CL · 2026-05-29 · unverdicted · none · ref 49 · internal anchor
EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.
GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning cs.CV · 2026-05-29 · unverdicted · none · ref 25 · internal anchor
GUI-C² pairs a difficulty-scoring data pipeline with an area-gated coarse-to-fine RL mechanism to improve GUI grounding accuracy and training stability.
Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models cs.CL · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
Proposes a neuron-level intervention method to locate and control gender-specific neurons across feminine, masculine, and neutral categories in LMs, achieving precise steering with less leakage than prior approaches.
Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
SkillPCF is a closed-loop agent framework with a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded evolution that improves design quality and efficiency for photonic crystal fiber inverse design under limited simulation budgets.
Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies cs.CL · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
Asymmetric power in LLM multi-agent commons simulations causes up to 87.3% lower survival rates than symmetric settings across eleven models.
Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents cs.CL · 2026-05-27 · unverdicted · none · ref 48 · internal anchor
Mobile-Aptus uses supervised fine-tuning followed by semantic similarity retrieval and direct preference optimization to calibrate confidence scores in mobile agents, yielding over 17% average task success improvement on four benchmarks.
GEM: Generative Supervision Helps Embodied Intelligence cs.CV · 2026-05-27 · unverdicted · none · ref 25 · internal anchor
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization cs.CL · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
Grounding Text Embeddings in Stakeholder Associations cs.CL · 2026-05-26 · unverdicted · none · ref 43 · internal anchor
The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.
Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models cs.CR · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming cs.CL · 2026-05-23 · unverdicted · none · ref 22 · internal anchor
CP-Agent improves LLM competitive programming performance via calibrated feedback mechanisms that target false-admission risk, evidence against bad programs, and success hazard.
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception cs.CV · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
CVSearch proposes an Assess-then-Search workflow combining expert-assisted search with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search to improve efficiency and accuracy on high-resolution image tasks for MLLMs.
CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs cs.CV · 2026-05-22 · unverdicted · none · ref 13 · internal anchor
CHASD is an inference-time framework that gates contrastive decoding via an uncertainty threshold and constructs negative branches through attention-guided perturbations of salient visual tokens to mitigate hallucinations in LVLMs.
Swift Sampling: Selecting Temporal Surprises via Taylor Series cs.CV · 2026-05-21 · unverdicted · none · ref 75 · internal anchor
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.
SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules cs.AI · 2026-05-21 · unverdicted · none · ref 16 · internal anchor
SciCore-Mol augments LLMs with three integrated modules for molecular perception, latent diffusion generation, and reaction reasoning, claiming an 8B open model competes with or exceeds proprietary systems on chemical tasks.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering cs.CV · 2026-05-21 · conditional · none · ref 16 · internal anchor
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support cs.AI · 2026-05-21 · unverdicted · none · ref 34 · internal anchor
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents cs.CL · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
SpecHop accelerates multi-hop LLM tool use via continuous multi-threaded speculation with asynchronous verification, approaching oracle latency gains and reducing latency up to 40% on retrieval tasks.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 39 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs cs.CR · 2026-05-20 · unverdicted · none · ref 20 · internal anchor
FRA-Attack uses high-pass DCT feature alignment and frequency-domain gradient regularization to boost adversarial transferability across 15 MLLMs from 7 vendors.
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision cs.CV · 2026-05-19 · unverdicted · none · ref 23 · internal anchor
Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding cs.CV · 2026-05-19 · unverdicted · none · ref 13 · 2 links · internal anchor
RE-VLM fuses RGB and event data in a dual-stream VLM with a graph-based pipeline for generating training captions and QA pairs, plus two new datasets, showing gains over RGB-only and event-only baselines especially in challenging conditions.
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training cs.SD · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.
Traditional statistical representations outperform generative AI in identifying expert peer reviewers cs.IR · 2026-05-18 · unverdicted · none · ref 101 · internal anchor
TF-IDF identifies labeled experts in the top 25 recommendations 79.5% of the time versus 51.5% for GPT-4o mini on an astronomy observatory dataset.
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens cs.CV · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning cs.AI · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents cs.CV · 2026-05-18 · unverdicted · none · ref 15 · internal anchor
AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.
ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation cs.AI · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
ECG-WM combines ODE physiological priors with latent diffusion models to generate intervention-conditioned ECG trajectories and uses diffusion stochasticity for uncertainty-aware clinical risk assessment.
Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification cs.AI · 2026-05-17 · unverdicted · none · ref 9 · internal anchor
CardioThink applies structured clinical reasoning stages and Structured Set Policy Optimization (SSPO) to ECG classification, yielding higher diagnostic accuracy and more interpretable rationales than direct prediction baselines on multiple benchmarks.
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents cs.AI · 2026-05-16 · unverdicted · none · ref 2 · internal anchor
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.
SE-GA: Memory-Augmented Self-Evolution for GUI Agents cs.LG · 2026-05-16 · unverdicted · none · ref 17 · internal anchor
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling cs.HC · 2026-05-15 · unverdicted · none · ref 31 · internal anchor
CTEM framework links behavioral history to evolving emotional states with user feedback updates, instantiated as Auri agent and tested in a 21-day study showing gains in naturalness, coherence, and emotional harmony.
Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments cs.AI · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
Empirical replication across three LLMs shows only 31 of 213 user-state metrics meet reliability criteria for individual scores, supporting a validation framework for responsible AI in adaptive environments.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks cs.SE · 2026-05-14 · unverdicted · none · ref 34 · internal anchor
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 56 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Curriculum Learning-Guided Progressive Distillation in Large Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
CLPD improves LLM distillation for reasoning by combining explicit data curriculum with progressive teacher scheduling of increasing capacity.
Control Charts for Multi-agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation cs.CV · 2026-05-11 · unverdicted · none · ref 68 · 2 links · internal anchor
SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 124 · 2 links · internal anchor
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning cs.LG · 2026-05-10 · unverdicted · none · ref 23 · internal anchor
FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
MiA-Signature: Approximating Global Activation for Long-Context Understanding cs.CL · 2026-05-07 · unverdicted · none · ref 12 · internal anchor
MiA-Signature approximates the global activation state induced by a query via submodular concept selection to enable tractable long-context understanding in LLMs.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 19 · 4 links · internal anchor
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models cs.LG · 2026-05-06 · unverdicted · none · ref 13 · internal anchor
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CV · 2026-05-04 · unverdicted · none · ref 29 · internal anchor
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection cs.CV · 2026-05-02 · unverdicted · none · ref 41 · internal anchor
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
Reasoning emerges from constrained inference manifolds in large language models cs.LG · 2026-05-02 · unverdicted · none · ref 4 · internal anchor
Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.
RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution cs.LG · 2026-05-01 · unverdicted · none · ref 14 · internal anchor
RunAgent improves LLM reliability on structured plans by deriving constraints on the fly, using an agentic language with control flow, and dynamically selecting reasoning modes, outperforming baselines on Natural-plan and SciBench.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration cs.CV · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer