super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

827 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 827 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

eess.AS · 2026-06-27 · unverdicted · novelty 7.0

GigaSpeechBench is a new 680-hour in-the-wild multilingual ASR/AST benchmark with five modules for low-resource languages, Chinese dialects, English accents, domain terminology, and age-varied speech, showing model performance drops.

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

citing papers explorer

Showing 50 of 827 citing papers.

ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge cs.CE · 2025-07-29 · unverdicted · none · ref 11 · internal anchor
ChemDFM-R is a chemical reasoning LLM trained via a four-stage pipeline on the ChemFG dataset of functional-group annotations for molecules and reactions, reaching performance comparable to or better than commercial models on chemical benchmarks.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 11 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs cs.CV · 2025-06-13 · unverdicted · none · ref 32 · internal anchor
A new prompting framework called Thought Graph Traversal combined with reasoning budget forcing improves test-time performance of frozen chest X-ray VLLMs on report generation benchmarks.
How Far Are We from Generating Missing Modalities with Foundation Models? cs.MM · 2025-06-04 · unverdicted · none · ref 7 · internal anchor
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.
Advancing AI Research Assistants with Expert-Involved Learning cs.AI · 2025-05-03 · unverdicted · none · ref 44 · internal anchor
ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.
Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessments cs.SE · 2025-05-02 · unverdicted · none · ref 12 · internal anchor
DRAFT fine-tunes LLMs with a dual-retrieval architecture and semi-automated datasets containing distractors to achieve 7% higher correctness in safety compliance assessments.
If Concept Bottlenecks are the Question, are Foundation Models the Answer? cs.LG · 2025-04-28 · unverdicted · none · ref 33 · internal anchor
Empirical tests of VLM-CBMs show VLM supervision differs from expert annotations depending on task and that concept accuracy correlates weakly with quality metrics.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 29 · internal anchor
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark cs.CL · 2025-03-22 · unverdicted · none · ref 9 · internal anchor
Introduces a competency-based GPBench benchmark and evaluates ten LLMs, concluding they require continuous human supervision for clinical general practice.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 25 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Large Language Models Can Help Mitigate Barren Plateaus in Quantum Neural Networks quant-ph · 2025-02-17 · unverdicted · none · ref 14 · internal anchor
AdaInit uses LLMs with submartingale properties to iteratively synthesize QNN initial parameters that maintain non-negligible gradient variance and mitigate barren plateaus, with claimed theoretical convergence guarantees and empirical outperformance.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models cs.SD · 2024-12-13 · unverdicted · none · ref 36 · internal anchor
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
In Context Learning and Reasoning for Symbolic Regression with Large Language Models cs.CL · 2024-10-22 · unverdicted · none · ref 93 · internal anchor
GPT-4 models rediscover Langmuir isotherms and produce fits on Nikuradse pipe-flow data via iterative chain-of-thought prompting with scientific context and external code feedback.
Audio-Mind: An Auditable Agentic Framework for Audio Understanding eess.AS · 2026-05-27 · unverdicted · none · ref 15 · internal anchor
Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.
Tracing the ongoing emergence of human-like reasoning in Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 73 · internal anchor
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms cs.RO · 2026-05-17 · unverdicted · none · ref 97 · internal anchor
A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map cs.CV · 2026-05-16 · unverdicted · none · ref 18 · internal anchor
LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.
Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank cs.LG · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
Multi-objective LTR combining clicks, VLM labels, and locale boosting improves relevance and local content visibility across five growth markets.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents cs.CL · 2026-05-11 · unverdicted · none · ref 214 · internal anchor
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation cs.CV · 2026-05-10 · unverdicted · none · ref 14 · internal anchor
EduStory combines pedagogical state modeling, structured script control, and new evaluation metrics to generate consistent multi-shot STEM videos while introducing the EduVideoBench diagnostic benchmark.
Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 37 · 2 links · internal anchor
Vision-language models can serve as zero-shot ODD sensors for autonomous driving when using definition-anchored chain-of-thought prompting with persona decomposition.
Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration cs.HC · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
A review synthesizes affective dynamics as a coordination layer in human-AI agent collaboration and proposes a framework for trust calibration, delegation, error correction, and governance.
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation cs.CL · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.
Architectural Constraints Alignment in AI-assisted, Platform-based Service Development cs.SE · 2026-05-06 · unverdicted · none · ref 10 · internal anchor
A retrieval-augmented scaffolding method using template retrieval and agentic clarification loops improves architectural consistency and deployability of AI-generated services over standard AI code generation.
OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction cs.AI · 2026-04-28 · unverdicted · none · ref 6 · internal anchor
OxyGent supplies a modular framework for multi-agent systems via the Oxy abstraction for composition and monitoring and the OxyBank engine for continuous automated evolution.
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms cs.RO · 2026-04-26 · accept · none · ref 28 · internal anchor
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
Reasoning-Aware AIGC Detection via Alignment and Reinforcement cs.AI · 2026-04-21 · unverdicted · none · ref 1 · internal anchor
REVEAL uses reasoning chains and two-stage SFT-plus-RL training to achieve state-of-the-art performance on AIGC detection across benchmarks with a new dataset.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 73 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild cs.CV · 2026-03-05 · unverdicted · none · ref 44 · internal anchor
Zero-shot MLLMs on ShanghaiTech and CHAD exhibit strong conservative bias with high precision but collapsed recall; class-specific prompts raise peak F1 from 0.09 to 0.64 yet recall remains the bottleneck.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization cs.CV · 2026-02-05 · unverdicted · none · ref 5 · internal anchor
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking cs.CL · 2026-01-08 · unverdicted · none · ref 10 · internal anchor
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images cs.CV · 2025-11-03 · unverdicted · none · ref 17 · internal anchor
Empirical benchmark of GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B finds superior OCR performance over EasyOCR but inconsistent gains in overall PHI detection accuracy, with strongest improvements on complex imprint patterns.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 35 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
AlphaQuanter: An End-to-End Tool-Augmented Agentic Reinforcement Learning Framework for Stock Trading cs.CE · 2025-10-16 · unverdicted · none · ref 27 · internal anchor
AlphaQuanter introduces a single-agent tool-augmented RL framework for stock trading that learns dynamic policies over a transparent decision workflow and reports state-of-the-art financial metrics.
Human-aligned AI Model Cards with Weighted Hierarchy Architecture cs.SE · 2025-10-08 · unverdicted · none · ref 18 · internal anchor
Introduces CRAI-MCF, an eight-module framework distilling 217 parameters from 240 projects into a quantitative sufficiency criterion for cross-model LLM comparison grounded in Value Sensitive Design.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 29 · internal anchor
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies cs.CL · 2025-08-24 · unverdicted · none · ref 33 · internal anchor
Systematic comparison of nine text-only and three multimodal LLMs using in-context learning, reasoning prompts, fine-tuning, and multimodal fusion on DementiaBank speech data finds class-centroid demonstrations and token-level fine-tuning most effective, with adapted open models matching or beating
Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design cs.LG · 2025-07-22 · unverdicted · none · ref 13 · internal anchor
A fine-tuned LLM called Perovskite-R1, built from curated perovskite literature and material libraries, proposes precursor additives and designs with some experimental validation showing improved stability and performance.
What Factors Affect LLMs and RLLMs in Financial Question Answering? cs.CL · 2025-07-11 · unverdicted · none · ref 6 · internal anchor
Prompting and agent methods boost standard LLMs on financial QA by simulating long chain-of-thought reasoning, but reasoning LLMs already have this capability and show limited further gains, while multilingual alignment helps mainly by lengthening reasoning with minimal benefit for reasoning models.
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems cs.CL · 2025-07-10 · unverdicted · none · ref 11 · internal anchor
Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning cs.CV · 2025-06-07 · unverdicted · none · ref 27 · internal anchor
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models cs.CL · 2025-06-05 · unverdicted · none · ref 4 · internal anchor
Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning, and model merging on Qwen3 backbones.
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction cs.AI · 2025-05-16 · unverdicted · none · ref 25 · internal anchor
InfantAgent-Next integrates tool-based and vision agents in a modular architecture and reports 7.27% accuracy on OSWorld, exceeding Claude-Computer-Use while also testing on GAIA and SWE-Bench.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 54 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Phi-4-reasoning Technical Report cs.AI · 2025-04-30 · unverdicted · none · ref 27 · internal anchor
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding cs.RO · 2025-04-13 · unverdicted · none · ref 24 · internal anchor
AirVista-II integrates agent-based task identification and scheduling, multimodal perception, and scenario-tailored keyframe extraction to deliver high-quality zero-shot semantic understanding for embodied UAVs in dynamic environments.
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting cs.AI · 2025-02-24 · unverdicted · none · ref 23 · internal anchor
An LLM agent with grounding, personalization, and marketing modules generates real estate descriptions that human buyers prefer over expert-written ones while matching factual accuracy.
Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 94 · internal anchor
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 77
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness cs.AI · 2026-05-17 · unverdicted · none · ref 34 · internal anchor
EGI integrates four existing AI components for real-time multimodal emotion monitoring and feedback in simulated agile meetings, reporting 10% WER and improved self-awareness for Scrum Masters.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer