UNITE:

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

1Google (YouTube), 2University of California, Riverside

Problem Overview: Existing DeepFake detection methods focus on identifying face-manipulated videos, most of which cannot perform inference unless there is a face detected in the video. However, with advancements like seamless background modifications (e.g., AVID) and hyper-realistic content from games like GTA-V and T2V/I2V models, a more comprehensive approach is needed. A model trained with only cross-entropy (CE) loss, using full frames, automatically focuses on the face, capturing temporal discontinuities through its transformer architecture, performing better than random (\( \approx \)) on T2V/I2V content but struggling with background manipulations. UNITE, with its attention-diversity (AD) loss, effectively detects both face/background manipulations and fully synthetic content.

Abstract

Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the Universal Network for Identifying Tampered and synthEtic videos (UNITE) model, which, unlike traditional detectors, captures full-frame manipulations. UNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that UNITE outperforms state-of-the-art detectors on datasets featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

Method

overall

UNITE architecture overview: We extract domain-agnostic features (\(\xi\)) using the SigLIP-So400m foundation model to mitigate domain gaps between DeepFake datasets. These embeddings, combined with positional encodings, are input to a transformer with multi-head attention and MLP layers, culminating in a classifier for final predictions. AD-loss encourages the model attention to span diverse spatial regions.

 

Results

results

Quantitative Results: Results from the UNITE model trained with (1) FF++ only and (2) FF++ combined with GTA-V. All other results reflect cross-dataset evaluations except for FF++ and GTA-V (when trained). Performance gains are highlighted in green.

sota_face

SOTA Comparison on Face-Manipulated Data: We compared the performance of UNITE with recent DeepFake detectors, in terms of detection accuracy on various face manipulated datasets. UNITE outperforms the existing methods. Bold shows the current best results and the previous best and second-best results are highlighted in red and blue respectively.

sota_synthetic

SOTA Comparison on Synthetic Data: On the DeMamba dataset (validation split), we compare the performance of UNITE, which was NOT trained on DeMamba train split, against state-of-the-art detectors which were trained on DeMamba train split (results taken from Chen et al.). We report the results (\(P\) = Precision and \(R\) = Recall) on the individual T2V/I2V generators and the average performance across the entire validation set (\(Avg\), which also includes real videos). Although the direct comparison is unfair against UNITE which was trained with FF++ and GTA-V, our method still outperforms these synthetic video detectors. Bold shows the current best results and the previous best and second-best results are highlighted in red and blue respectively.

3-class finegrained

Results on 3-class Finegrained Classes: Results obtained by the UNITE model on 3-class finegrained classes: detecting whether a video is real, partially manipulated or fully AI-generated. Performance gains are mentioned in green.

4-class finegrained

Results on 4-class Finegrained Classes: We divide the existing DeepFake datasets into four categories- (1) face-swap, (2) face-reenactment (3) fully synthetic and (4) real, to perform a 4-class fine-grained classification using UNITE. Performance gains are mentioned in green. The results show that UNITE can effectively classify the videos into these four categories.

Ablation Study

loss_ablation

Ablation of Loss Functions: Ablation Results to show the effect of changing the loss functions used to train the UNITE model. The combination of the cross-entropy (CE) and Attention Diversity (AD) losses always performs better, and the contribution from the AD-loss component increases significantly when using fully synthetic data for training.

frames_vs_performance

No. of frames vs. Performance: Performance analysis of UNITE based on the number of frames sampled per video segment. The results illustrate that as the number of frames increases from 1 to 64 (context window), the detection accuracy improves, showcasing UNITE's ability to effectively capture temporal inconsistencies in fake videos.

depth_vs_performance

Transformer Depth Evaluation: Performance comparison of UNITE with varying the number of encoder blocks (depth). UNITE performs optimally in cross-domain settings when the depth is 4. Greater depths overfit to the training domains (FF++ and GTA-V), while a depth of 2 is insufficient to capture the complexity of the data.

foundation_model_backbone

Ablation of Foundation Model Backbone: Results (accuracy) obtained by UNITE when trained with DINOv2 features of FF++ and GTA-V instead of SigLIP-So400m. The results show that the performance gain indeed comes from the AD-loss implementation and is not dependent on the choice of the foundation model.

tsne

t-SNE Analysis: t-SNE plots with (w) and without (w/o) AD-loss in cross-dataset settings.

The results from the t-SNE plots reveal a significant improvement in class separability when the AD-loss is incorporated into the training process. The features learned with AD-loss exhibit a clearer distinction between real and fake samples, indicating that the AD-loss helps to create a more discriminative feature space. This enhanced separability is particularly notable in cross-dataset settings, where the model is trained on one dataset but evaluated on another. The improved class separation in these scenarios suggests that the AD-loss not only enhances the model's performance on the training dataset but also improves its generalizability across different datasets.



foundation_model_backbone

Ablation of AD-Loss Hyperparameters: Performance comparison of UNITE across varying values of the (a) \(\delta_{within}\) and (b) \(\delta_{between}\) hyperparameters. The results indicate that the model's learning is relatively robust to changes in the hyperparameters. Specifically in (a) the results are consistent when the signs of the first and second parameters of \(\delta_{within}\) are opposite.

comprehensive_ablation

Evolution of detection performance under different ablation settings: This table highlights the impact of various modifications to the UNITE training pipeline on the detection performance (accuracy) across multiple datasets. Starting from a base model using a simple average pooling model on SigLIP-So400m features, we show the effect of changing the architecture to a transformer, incorporating synthetic data into the training process, and adding the proposed AD-Loss. These settings progressively enhance performance, with the addition of AD-Loss achieving near-perfect or significantly improved results across all datasets.

NOTE: More results and analysis are available in our supplementary material.

Authors

Rohit Kundu

Rohit Kundu

Hao Xiong

Hao Xiong

Vishal Mohanty

Vishal Mohanty

Athula Balachandran

Athula Balachandran

Amit K. Roy-Chowdhury

Amit K. Roy-Chowdhury

BibTeX

@inproceedings{kundu2025towards,
  title={Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content},
  author={Kundu, Rohit and Xiong, Hao and Mohanty, Vishal and Balachandran, Athula and Roy-Chowdhury, Amit K},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition (CVPR) Conference},
  pages={28050--28060},
  year={2025}
}

Copyright: CC BY-NC-SA 4.0 © Rohit Kundu | Last updated: 07 June 2025 | Website credits to Nerfies