super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

824 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 824 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

eess.AS · 2026-06-27 · unverdicted · novelty 7.0

GigaSpeechBench is a new 680-hour in-the-wild multilingual ASR/AST benchmark with five modules for low-resource languages, Chinese dialects, English accents, domain terminology, and age-varied speech, showing model performance drops.

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

citing papers explorer

Showing 50 of 824 citing papers.

Multi-modal user interface control detection using cross-attention cs.CV · 2026-04-08 · unverdicted · none · ref 41 · internal anchor
A multi-modal YOLOv5 model fuses GPT text descriptions via cross-attention and convolutional fusion to achieve better UI control detection than baseline on 16,000+ screenshots.
From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI cs.AI · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.
Identifying Influential N-grams in Confidence Calibration via Regression Analysis cs.CL · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
Regression identifies specific n-grams in LLM reasoning that drive overconfidence, enabling calibration via their suppression without performance loss.
AI and Collective Decisions: Strengthening Legitimacy and Losers' Consent cs.HC · 2026-04-07 · unverdicted · none · ref 37 · internal anchor
An AI system that elicits personal experiences and visualizes policy support increased perceived legitimacy and perspective-taking in collective decisions despite unfavorable outcomes.
ID-Sim: An Identity-Focused Similarity Metric cs.CV · 2026-04-06 · unverdicted · none · ref 30 · internal anchor
ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts cs.CL · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics cs.CY · 2026-04-02 · unverdicted · none · ref 87 · 2 links · internal anchor
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs cs.IR · 2026-03-30 · conditional · none · ref 20 · internal anchor
SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball articles.
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs cs.CV · 2026-03-29 · unverdicted · none · ref 11 · internal anchor
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI · 2026-03-18 · unverdicted · none · ref 13 · internal anchor
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor cs.DB · 2026-03-10 · unverdicted · none · ref 32 · internal anchor
LLMs can outperform DTA on index recommendations for some workloads but remain less reliable with practical adoption challenges.
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning cs.CV · 2026-02-10 · unverdicted · none · ref 10 · internal anchor
Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.
High-quality generation of dynamic game content via small language models: A proof of concept cs.AI · 2026-01-30 · conditional · none · ref 20 · internal anchor
Proof-of-concept shows fine-tuned small language models achieve adequate quality for real-time game content generation in a scoped RPG loop via retry-until-success and LLM-as-judge evaluation.
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving cs.AI · 2026-01-29 · unverdicted · none · ref 12 · internal anchor
An MLLM interpreter generates concise CDL descriptions from diagrams, enabling an off-the-shelf LLM to solve plane geometry problems competitively after training on only 5.5k examples.
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? cs.CV · 2026-01-11 · unverdicted · none · ref 14 · internal anchor
Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
Dual-Margin Embedding for Fine-Grained Long-Tailed Plant Taxonomy cs.CV · 2025-12-22 · unverdicted · none · ref 11 · internal anchor
TaxoNet uses a dual-margin objective to reshape decision boundaries in long-tailed fine-grained plant taxonomy, improving rare-class geometry under open-world conditions.
LongCat-Image Technical Report cs.CV · 2025-12-08 · unverdicted · none · ref 27 · internal anchor
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
OneThinker: All-in-one Reasoning Model for Image and Video cs.CV · 2025-12-02 · unverdicted · none · ref 52 · internal anchor
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture cs.AI · 2025-11-28 · unverdicted · none · ref 18 · internal anchor
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation cs.IR · 2025-11-24 · unverdicted · none · ref 22 · internal anchor
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning cs.CV · 2025-11-19 · unverdicted · none · ref 15 · internal anchor
AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 24 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation cs.CV · 2025-10-24 · unverdicted · none · ref 14 · internal anchor
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
MERIT: Modular Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning cs.AI · 2025-10-20 · unverdicted · none · ref 27 · internal anchor
MERIT achieves 81.65% F1 on MMFakeBench for multimodal misinformation detection via a four-module framework, outperforming zero-shot baselines like GPT-4V with MMD-Agent at 74.0% F1, with gains attributed to architectural design.
Reflection-Based Task Adaptation for Self-Improving VLA cs.RO · 2025-10-14 · unverdicted · none · ref 29 · internal anchor
Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.
Fall into a Pit, Gain in a Wit: Cognitive-Guided Harmful Meme Detection via Misjudgment Risk Pattern Retrieval cs.LG · 2025-10-10 · unverdicted · none · ref 19 · internal anchor
PatMD improves harmful meme detection by retrieving misjudgment risk patterns to guide MLLMs, reporting 8.30% average F1 and 7.71% accuracy gains on 6,626 memes across 5 tasks.
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models cs.SD · 2025-10-01 · unverdicted · none · ref 6 · internal anchor
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
Semantic-Aware Logical Reasoning via a Semiotic Framework cs.AI · 2025-09-29 · conditional · none · ref 22 · internal anchor
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs cs.HC · 2025-09-29 · unverdicted · none · ref 14 · internal anchor
Medium-sized LLMs can supply useful feedback on game concepts in early design stages, as demonstrated by model comparisons and a positive student pilot study.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning cs.AI · 2025-09-26 · unverdicted · none · ref 15 · internal anchor
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.
RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows cs.MA · 2025-09-24 · unverdicted · none · ref 6 · internal anchor
RadAgents is a multi-agent framework coupling clinical priors with task-aware multimodal reasoning and radiologist-like workflows, plus grounding and retrieval-augmentation for conflict resolution in chest X-ray interpretation.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 37 · internal anchor
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Digital Voices of Survival: From Social Media Disclosures to Support Provisions for Domestic Violence Victims cs.SI · 2025-09-15 · unverdicted · none · ref 2 · internal anchor
A four-component computational framework is proposed and tested on social media data to model domestic violence self-disclosure and community support provisions.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 13 · internal anchor
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs cs.CV · 2025-09-09 · unverdicted · none · ref 9 · internal anchor
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing cs.CV · 2025-09-02 · unverdicted · none · ref 10 · internal anchor
Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning cs.CV · 2025-08-28 · unverdicted · none · ref 9 · internal anchor
Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.
CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models cs.CV · 2025-08-28 · unverdicted · none · ref 22 · internal anchor
CraftGraffiti applies LoRA-tuned diffusion transformers followed by identity-augmented self-attention and CLIP-guided pose extension to generate graffiti while preserving facial features.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL · 2025-08-28 · unverdicted · none · ref 46 · internal anchor
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning cs.AI · 2025-08-27 · unverdicted · none · ref 10 · internal anchor
InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.
Generative Interfaces for Language Models cs.CL · 2025-08-26 · unverdicted · none · ref 14 · internal anchor
Generative interfaces let LLMs create task-specific UIs that users prefer up to 72% more than standard chat responses across tested tasks.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey cs.RO · 2025-08-18 · unverdicted · none · ref 176 · internal anchor
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs cs.RO · 2025-08-16 · unverdicted · none · ref 31 · internal anchor
LLM-based autonomous semantic compression in four 2D UAV swarm simulations shows potential for efficient collaborative communication under bandwidth constraints.
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning cs.CV · 2025-08-15 · unverdicted · none · ref 1 · internal anchor
UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.
gpt-oss-120b & gpt-oss-20b Model Card cs.CL · 2025-08-08 · unverdicted · none · ref 18 · internal anchor
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL · 2025-08-06 · unverdicted · none · ref 53 · internal anchor
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge cs.CE · 2025-07-29 · unverdicted · none · ref 11 · internal anchor
ChemDFM-R is a chemical reasoning LLM trained via a four-stage pipeline on the ChemFG dataset of functional-group annotations for molecules and reactions, reaching performance comparable to or better than commercial models on chemical benchmarks.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 11 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs cs.CV · 2025-06-13 · unverdicted · none · ref 32 · internal anchor
A new prompting framework called Thought Graph Traversal combined with reasoning budget forcing improves test-time performance of frozen chest X-ray VLLMs on report generation benchmarks.
How Far Are We from Generating Missing Modalities with Foundation Models? cs.MM · 2025-06-04 · unverdicted · none · ref 7 · internal anchor
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer