Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Aerin Kim; Arnab Karmakar; Ben Caffee; Changyeon Lee; Emmanuel Tanumihardja; Hannah Lee; Jongwook Choi; Kevin Farhat; Lin Qiu; Nuria Alina Chandra

arxiv: 2503.02857 · v5 · pith:ABM3DDWWnew · submitted 2025-03-04 · 💻 cs.CV · cs.AI· cs.CY

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Nuria Alina Chandra , Hannah Lee , Ryan Murtfeldt , Lin Qiu , Arnab Karmakar , Emmanuel Tanumihardja , Kevin Farhat , Ben Caffee

show 5 more authors

Changyeon Lee Jongwook Choi Sejin Paik Aerin Kim Oren Etzioni

This is my paper

classification 💻 cs.CV cs.AIcs.CY

keywords deepfakedeepfake-eval-2024detectionmodelsbenchmarkdeepfakesacademicaccuracy

0 comments

read the original abstract

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories
cs.CY 2026-05 unverdicted novelty 7.0

Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.
Automated In-the-Wild Data Collection for Continual AI Generated Image Detection
cs.CV 2026-05 unverdicted novelty 7.0

An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy g...
ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection
cs.SD 2026-04 unverdicted novelty 7.0

ICLAD combines in-context learning and comparison guidance in audio language models with a routing detector to boost generalization and explanations for audio deepfake detection, achieving up to 2x F1 gains on wild data.
The Impact of AI-Generated Text on the Internet
cs.CY 2026-04 unverdicted novelty 7.0

By mid-2025 roughly 35% of new websites are AI-generated or AI-assisted, correlating with lower semantic diversity and higher positive sentiment but showing no significant drop in factual accuracy or stylistic diversity.
A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
eess.AS 2026-03 unverdicted novelty 7.0

Spoof-SUPERB benchmark shows large-scale discriminative SSL models such as XLS-R, UniSpeech-SAT, and WavLM Large outperform others in audio deepfake detection and maintain robustness under acoustic degradations.
Alethia: A Foundational Encoder for Voice Deepfakes
cs.SD 2026-04 unverdicted novelty 6.0

Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...
Aletheia: Physics-Conditioned Localized Artifact Attention (PhyLAA-X) for End-to-End Generalizable and Robust Deepfake Video Detection
cs.CV 2026-04 unverdicted novelty 6.0

PhyLAA-X embeds physics-derived feature volumes into localized artifact attention for improved cross-generator generalization and adversarial robustness in deepfake detection.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
cs.CV 2026-05 unverdicted novelty 5.0

Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge
cs.CV 2026-05 unverdicted novelty 4.0

The SAFE challenge shows measurable progress in detecting synthetic videos across different generators but persistent weaknesses against post-processing operations.
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
cs.CR 2026-05 unverdicted novelty 3.0

The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institution...