Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Abstract

Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the Universal Network for Identifying Tampered and synthEtic videos (UNITE) model, which, unlike traditional detectors, captures full-frame manipulations. UNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that UNITE outperforms state-of-the-art detectors on datasets featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

Method

UNITE architecture overview: We extract domain-agnostic features (\(\xi\)) using the SigLIP-So400m foundation model to mitigate domain gaps between DeepFake datasets. These embeddings, combined with positional encodings, are input to a transformer with multi-head attention and MLP layers, culminating in a classifier for final predictions. AD-loss encourages the model attention to span diverse spatial regions.

Results

Quantitative Results: Results from the UNITE model trained with (1) FF++ only and (2) FF++ combined with GTA-V. All other results reflect cross-dataset evaluations except for FF++ and GTA-V (when trained). Performance gains are highlighted in green.

SOTA Comparison on Face-Manipulated Data: We compared the performance of UNITE with recent DeepFake detectors, in terms of detection accuracy on various face manipulated datasets. UNITE outperforms the existing methods. Bold shows the current best results and the previous best and second-best results are highlighted in red and blue respectively.

SOTA Comparison on Synthetic Data: On the DeMamba dataset (validation split), we compare the performance of UNITE, which was NOT trained on DeMamba train split, against state-of-the-art detectors which were trained on DeMamba train split (results taken from Chen et al.). We report the results (\(P\) = Precision and \(R\) = Recall) on the individual T2V/I2V generators and the average performance across the entire validation set (\(Avg\), which also includes real videos). Although the direct comparison is unfair against UNITE which was trained with FF++ and GTA-V, our method still outperforms these synthetic video detectors. Bold shows the current best results and the previous best and second-best results are highlighted in red and blue respectively.

Results on 3-class Finegrained Classes: Results obtained by the UNITE model on 3-class finegrained classes: detecting whether a video is real, partially manipulated or fully AI-generated. Performance gains are mentioned in green.

Results on 4-class Finegrained Classes: We divide the existing DeepFake datasets into four categories- (1) face-swap, (2) face-reenactment (3) fully synthetic and (4) real, to perform a 4-class fine-grained classification using UNITE. Performance gains are mentioned in green. The results show that UNITE can effectively classify the videos into these four categories.

Ablation Study

Ablation of Loss Functions: Ablation Results to show the effect of changing the loss functions used to train the UNITE model. The combination of the cross-entropy (CE) and Attention Diversity (AD) losses always performs better, and the contribution from the AD-loss component increases significantly when using fully synthetic data for training.

No. of frames vs. Performance: Performance analysis of UNITE based on the number of frames sampled per video segment. The results illustrate that as the number of frames increases from 1 to 64 (context window), the detection accuracy improves, showcasing UNITE's ability to effectively capture temporal inconsistencies in fake videos.

Transformer Depth Evaluation: Performance comparison of UNITE with varying the number of encoder blocks (depth). UNITE performs optimally in cross-domain settings when the depth is 4. Greater depths overfit to the training domains (FF++ and GTA-V), while a depth of 2 is insufficient to capture the complexity of the data.

Ablation of Foundation Model Backbone: Results (accuracy) obtained by UNITE when trained with DINOv2 features of FF++ and GTA-V instead of SigLIP-So400m. The results show that the performance gain indeed comes from the AD-loss implementation and is not dependent on the choice of the foundation model.

t-SNE Analysis: t-SNE plots with (w) and without (w/o) AD-loss in cross-dataset settings.

The results from the t-SNE plots reveal a significant improvement in class separability when the AD-loss is incorporated into the training process. The features learned with AD-loss exhibit a clearer distinction between real and fake samples, indicating that the AD-loss helps to create a more discriminative feature space. This enhanced separability is particularly notable in cross-dataset settings, where the model is trained on one dataset but evaluated on another. The improved class separation in these scenarios suggests that the AD-loss not only enhances the model's performance on the training dataset but also improves its generalizability across different datasets.

Ablation of AD-Loss Hyperparameters: Performance comparison of UNITE across varying values of the (a) \(\delta_{within}\) and (b) \(\delta_{between}\) hyperparameters. The results indicate that the model's learning is relatively robust to changes in the hyperparameters. Specifically in (a) the results are consistent when the signs of the first and second parameters of \(\delta_{within}\) are opposite.

Evolution of detection performance under different ablation settings: This table highlights the impact of various modifications to the UNITE training pipeline on the detection performance (accuracy) across multiple datasets. Starting from a base model using a simple average pooling model on SigLIP-So400m features, we show the effect of changing the architecture to a transformer, incorporating synthetic data into the training process, and adding the proposed AD-Loss. These settings progressively enhance performance, with the addition of AD-Loss achieving near-perfect or significantly improved results across all datasets.

NOTE: More results and analysis are available in our supplementary material.

Authors

Rohit Kundu

Hao Xiong

Vishal Mohanty

Athula Balachandran

Amit K. Roy-Chowdhury

BibTeX

@inproceedings{kundu2025towards,
  title={Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content},
  author={Kundu, Rohit and Xiong, Hao and Mohanty, Vishal and Balachandran, Athula and Roy-Chowdhury, Amit K},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition (CVPR) Conference},
  pages={28050--28060},
  year={2025}
}

UNITE: