research-article

Free Access

Just Accepted

Deepfake Detection Using Spatiotemporal Transformer

Authors:
Bachir Kaddar

University of Ibn Khaldoun-Tiaret, Algeria

University of Ibn Khaldoun-Tiaret, Algeria
Search about this author

,
Sid Ahmed Fezza

National Higher School of Telecommunications and ICT, Algeria

National Higher School of Telecommunications and ICT, Algeria
Search about this author

,
Zahid Akhtar

State University of New York Polytechnic Institute, USA

State University of New York Polytechnic Institute, USA
Search about this author

,
Wassim Hamidouche

Univ. Rennes, INSA Rennes, CNRS, France

Univ. Rennes, INSA Rennes, CNRS, France
Search about this author

,
Abdenour Hadid

Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi, UAE

Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi, UAE
Search about this author

,
Joan Serra-Sagristà

Universitat Autònoma de Barcelona, Spain

Universitat Autònoma de Barcelona, Spain
Search about this author

ACM Transactions on Multimedia Computing, Communications, and ApplicationsAccepted on January 2024https://doi.org/10.1145/3643030

Online AM:23 January 2024Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this paper proposes a framework that combines aper proposes a framework that combines with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named HCiT, exploits the advantages of CNNs to extract meaningful local features, as well as the VIT’s self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation. The source code is available at: https://github.com/KADDAR-Bachir/HCiT

References

Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1–7. https://doi.org/10.1109/WIFS.2018.8630761Google ScholarCross Ref
Akshay Agarwal, Richa Singh, Mayank Vatsa, and Afzel Noore. 2017. Swapped! digital face presentation attack detection via weighted local magnitude pattern. In 2017 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 659–665.Google ScholarDigital Library
Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting World Leaders Against Deep Fakes. In CVPR workshops, Vol. 1. 38.Google Scholar
Henry Ajder, Giorgio Patrini, Francesco Cavalli, and Laurence Cullen. 2019. The state of deepfakes: Landscape, threats, and impact. Amsterdam: Deeptrace(2019).Google Scholar
Zahid Akhtar and Dipankar Dasgupta. 2019. A comparative evaluation of local feature descriptors for deepfakes detection. In 2019 IEEE International Symposium on Technologies for Homeland Security (HST). IEEE, 1–5. https://doi.org/10.1109/HST47167.2019.9033005Google ScholarCross Ref
Irene Amerini, Leonardo Galteri, Roberto Caldelli, and Alberto Del Bimbo. 2019. Deepfake video detection through optical flow based cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0. https://doi.org/10.1109/ICCVW.2019.00152Google ScholarCross Ref
Mauro Barni, Ehsan Nowroozi, and Benedetta Tondi. 2018. Detection of adaptive histogram equalization robust against JPEG compression. In 2018 International Workshop on Biometrics and Forensics (IWBF). IEEE, 1–8. https://doi.org/10.1109/IWBF.2018.8401564Google ScholarCross Ref
Belhassen Bayar and Matthew C Stamm. 2016. A deep learning approach to universal image manipulation detection using a new convolutional layer. In Proceedings of the 4th ACM workshop on information hiding and multimedia security. 5–10. https://doi.org/10.1145/2909827.2930786Google ScholarDigital Library
Belhassen Bayar and Matthew C Stamm. 2018. Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection. IEEE Transactions on Information Forensics and Security 13, 11(2018), 2691–2706. https://doi.org/10.1109/TIFS.2018.2825953Google ScholarCross Ref
Mikołaj Bińkowski, Dougal J Sutherland, Mi-chael Arbel, and Arthur Gretton. 2018. Demystifying mmd gans. arXiv preprint arXiv:1801.01401(2018).Google Scholar
Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K Nayar. 2008. Face swapping: automatically replacing faces in photographs. In ACM SIGGRAPH 2008 papers. 1–8. https://doi.org/10.1145/1399504.1360638Google ScholarDigital Library
Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong Ding, and Xiaokang Yang. 2022. End-to-end reconstruction-classification learning for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4113–4122.Google ScholarCross Ref
Bobby Chesney and Danielle Citron. 2019. Deep fakes: a looming challenge for privacy, democracy, and national security. Calif. L. Rev. 107(2019), 1753. https://doi.org/10.2139/ssrn.3213954Google ScholarCross Ref
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251–1258. https://doi.org/10.1109/CVPR.2017.195Google ScholarCross Ref
Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2015. Splicebuster: A new blind image splicing detector. In 2015 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1–6. https://doi.org/10.1109/WIFS.2015.7368565Google ScholarCross Ref
Davide Cozzolino and Luisa Verdoliva. 2019. Noiseprint: a CNN-based camera model fingerprint. IEEE Transactions on Information Forensics and Security 15 (2019), 144–159. https://doi.org/10.1109/TIFS.2019.2916364Google ScholarCross Ref
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248–255. https://doi.org/10.1109/CVPR.2009.5206848Google ScholarCross Ref
Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. 2019. Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641(2019).Google Scholar
Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. 2019. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854(2019).Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020).Google Scholar
Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun. 2021. Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning. PMLR, 2286–2296. https://doi.org/10.1088/1742-5468/ac9830Google ScholarCross Ref
Hany Farid. 2016. Photo forensics. MIT press. https://doi.org/10.7551/mitpress/10451.001.0001Google ScholarCross Ref
Jessica Fridrich and Jan Kodovsky. 2012. Rich models for steganalysis of digital images. IEEE Transactions on information Forensics and Security 7, 3(2012), 868–882. https://doi.org/10.1109/TIFS.2012.2190402Google ScholarDigital Library
Miroslav Goljan and Jessica Fridrich. 2015. CFA-aware features for steganalysis of color images. In Media Watermarking, Security, and Forensics 2015, Vol. 9409. SPIE, 279–291. https://doi.org/10.1117/12.2078399Google ScholarCross Ref
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144. https://doi.org/10.1145/3422622Google ScholarDigital Library
David Güera and Edward J Delp. 2018. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1–6. https://doi.org/10.1109/AVSS.2018.8639163Google ScholarCross Ref
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 3154–3160. https://doi.org/10.1109/ICCVW.2017.373Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37, 9(2015), 1904–1916.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pat-tern recognition. 770–778. https://doi.org/10.1109/CVPR.2016.90Google ScholarCross Ref
Young-Jin Heo, Young-Ju Choi, Young-Woon Lee, and Byung-Gyu Kim. 2021. Deepfake Detection Scheme Based on Vision Transformer and Distillation. arXiv preprint arXiv:2104.01353(2021).Google Scholar
Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. 2018. Fighting fake news: Image splice detection via learned self-consistency. In Proceedings of the European conference on computer vision (ECCV). 101–117. https://doi.org/10.1007/978-3-030-01252-6_7Google ScholarDigital Library
Bachir Kaddar, Sid Ahmed Fezza, Wassim Hamidouche, Zahid Akhtar, and Abdenour Hadid. 2023. On the effectiveness of handcrafted features for deepfake video detection. Journal of Electronic Imaging 32, 5 (2023), 053033–053033.Google ScholarCross Ref
Tero Karras, Timo Aila, Samuli Laine, and Jaak-ko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196(2017).Google Scholar
Davis E King. 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research 10 (2009), 1755–1758. https://doi.org/10.5555/1577069.1755843Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).Google Scholar
Pavel Korshunov, Michael Halstead, Diego Castan, Martin Graciarena, Mitchell McLaren, Brian Burns, Aaron Lawson, and Sebastien Marcel. 2019. Tampered speaker inconsistency detection with phonetically aware audio-visual features. In International Conference on Machine Learning.Google Scholar
Pavel Korshunov and Sébastien Marcel. 2018. Speaker inconsistency detection in tampered video. In 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2375–2379. https://doi.org/10.23919/EUSIPCO.2018.8553270Google ScholarCross Ref
Pavel Korshunov and Sébastien Marcel. 2021. Subjective and Objective Evaluation of Deepfake Videos. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2510–2514. https://doi.org/10.1109/ICASSP39728.2021.9414258Google ScholarCross Ref
Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5001–5010.Google ScholarCross Ref
Yuezun Li, Ming-Ching Chang, and Siwei Lyu. 2018. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877(2018).Google Scholar
Yuezun Li and Siwei Lyu. 1811. Exposing deepfake videos by detecting face warping artifacts. arXiv 2018. arXiv preprint arXiv:1811.00656(1811).Google Scholar
Yuezun Li and Siwei Lyu. 2018. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656(2018).Google Scholar
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2020. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3207–3216. https://doi.org/10.1109/CVPR42600.2020.00327Google ScholarCross Ref
Decheng Liu, Zhan Dang, Chunlei Peng, Yu Zheng, Shuang Li, Nannan Wang, and Xinbo Gao. 2023. FedForgery: generalized face forgery detection with residual federated learning. IEEE Transactions on Information Forensics and Security (2023).Google Scholar
Sara Mandelli, Nicolò Bonettini, Paolo Bestagini, Vincenzo Lipari, and Stefano Tubaro. 2018. Multiple JPEG compression detection through task-driven non-negative matrix factorization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2106–2110.Google ScholarDigital Library
Bernard Marr. 2019. The best (and scariest) examples of AI-enabled deepfakes. Forbes. https://cutt. ly/vK0OcsP(2019).Google Scholar
Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and Luisa Verdoliva. 2018. Detection of gan-generated fake images over social networks. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 384–389. https://doi.org/10.1109/MIPR.2018.00084Google ScholarCross Ref
Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. 2022. Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence(2022), 1–53. https://doi.org/10.1007/s10489-022-03766-zGoogle ScholarDigital Library
Falko Matern, Christian Riess, and Marc Stamminger. 2019. Exploiting visual artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 83–92. https://doi.org/10.1109/WACVW.2019.00020Google ScholarCross Ref
Scott McCloskey and Michael Albright. 2018. Detecting gan-generated imagery using color cues. arXiv preprint arXiv:1812.08247(2018).Google Scholar
Yisroel Mirsky and Wenke Lee. 2021. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR) 54, 1 (2021), 1–41. https://doi.org/10.1145/3425780Google ScholarDigital Library
Huaxiao Mo, Bolin Chen, and Weiqi Luo. 2018. Fake faces identification via convolutional neural network. In Proceedings of the 6th ACM Workshop on Information Hiding and Multimedia Security. 43–47. https://doi.org/10.1145/3206004.3206009Google ScholarDigital Library
Joao C Neves, Ruben Tolosana, Ruben Vera-Rodriguez, Vasco Lopes, Hugo Proença, and Julian Fierrez. 2020. Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection. IEEE Journal of Selected Topics in Signal Processing 14, 5(2020), 1038–1048. https://doi.org/10.1109/JSTSP.2020.3007250Google ScholarCross Ref
Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019. Multi-task learning for detecting and segmenting manipulated facial images and videos. In 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 1–8. https://doi.org/10.1109/BTAS46853.2019.9185974Google ScholarDigital Library
Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. 2019. Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2307–2311. https://doi.org/10.1109/ICASSP.2019.8682602Google ScholarCross Ref
Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. 2019. Use of a capsule network to detect fake images and videos. arXiv preprint arXiv:1910.12467(2019).Google Scholar
Xunyu Pan, Xing Zhang, and Siwei Lyu. 2012. Exposing image splicing with inconsistent local noise variances. In 2012 IEEE International Conference on Computational Photography (ICCP). IEEE, 1–10. https://doi.org/10.1109/ICCPhot.2012.6215223Google ScholarCross Ref
Nicolas Rahmouni, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2017. Distinguishing computer graphics from natural images using convolution neural networks. In 2017 IEEE workshop on information forensics and security (WIFS). IEEE, 1–6. https://doi.org/10.1109/WIFS.2017.8267647Google ScholarCross Ref
Judith A Redi, Wiem Taktak, and Jean-Luc Dugelay. 2011. Digital image forensics: a booklet for beginners. Multimedia Tools and Applications 51, 1 (2011), 133–162. https://doi.org/10.1007/s11042-010-0620-1Google ScholarDigital Library
Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2018. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179(2018).Google Scholar
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision. 1–11. https://doi.org/10.1109/ICCV.2019.00009Google ScholarCross Ref
Ritaban Roy, Indu Joshi, Abhijit Das, and Antitza Dantcheva. 2022. 3D CNN architectures and attention mechanisms for deepfake detection. In Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks. Springer International Publishing Cham, 213–234.Google Scholar
Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. 2019. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 3, 1 (2019), 80–87.Google Scholar
Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. 2021. On the adversarial robustness of visual transformers. arXiv preprint arXiv:2103.15670 2, 7 (2021).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).Google Scholar
Marina Sokolova, Nathalie Japkowicz, and Stan Szpakowicz. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In AI 2006: Advances in Artificial Intelligence: 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, December 4-8, 2006. Proceedings 19. Springer, 1015–1021. https://doi.org/=10.1007/11941439_114Google ScholarDigital Library
Ke Sun, Taiping Yao, Shen Chen, Shouhong Ding, Jilin Li, and Rongrong Ji. 2022. Dual contrastive learning for general face forgery detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2316–2324.Google ScholarCross Ref
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31. https://doi.org/10.1609/aaai.v31i1.11231Google ScholarCross Ref
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.Google Scholar
Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias Nießner. 2018. Headon: Real-time reenactment of human portrait videos. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–13. https://doi.org/10.1145/3197517.3201350Google ScholarDigital Library
Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion 64(2020), 131–148. https://doi.org/10.1016/j.inffus.2020.06.014Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017). https://doi.org/doi/10.5555/3295222.3295349Google Scholar
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning. 1096–1103. https://doi.org/10.1145/1390156.1390294Google ScholarDigital Library
Chengrui Wang and Weihong Deng. 2021. Representative forgery mining for fake face detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14923–14932.Google ScholarCross Ref
Junke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Yu-Gang Jiang, and Ser-Nam Li. 2022. M2tr: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 615–623. https://doi.org/10.1145/3512527.3531415Google ScholarDigital Library
Deressa Wodajo and Solomon Atnafu. 2021. Deepfake Video Detection Using Convolutional Vision Transformer. arXiv preprint arXiv:2102.11126(2021).Google Scholar
Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8261–8265. https://doi.org/10.1109/ICASSP.2019.8683164Google ScholarCross Ref
Ning Yu, Larry S Davis, and Mario Fritz. 2019. Attributing fake images to gans: Learning and analyzing gan fingerprints. In Proceedings of the IEEE/CVF international conference on computer vision. 7556–7566. https://doi.org/10.1109/ICCV.2019.00765Google ScholarCross Ref
Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 579–588. https://doi.org/10.1109/ICCV48922.2021.00062Google ScholarCross Ref
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23, 10 (2016), 1499–1503.Google ScholarCross Ref
Ying Zhang, Lilei Zheng, and Vrizlynn LL Thing. 2017. Automated face swapping and its detection. In 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP). IEEE, 15–19. https://doi.org/10.1109/SIPROCESS.2017.8124497Google ScholarCross Ref
Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. 2017. Two-stream neural networks for tampered face detection. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, 1831–1839. https://doi.org/10.1109/CVPRW.2017.229Google ScholarCross Ref
Yi Zhu and Shawn Newsam. 2017. Densenet for dense flow. In 2017 IEEE international conference on image processing (ICIP). IEEE, 790–794. https://doi.org/10.1109/ICIP.2017.8296389Google ScholarDigital Library

Index Terms

Deepfake Detection Using Spatiotemporal Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep Convolutional Pooling Transformer for Deepfake Detection
Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed ...
Read More
Efficient deepfake detection using shallow vision transformer
Abstract
Deepfake is a deep learning-based technique that generates fake face images by mimicking the distribution of original images. Deepfake images can be used for malicious intent like creating fake news; hence, it is important to detect them at an ...
Read More
DeepFake detection with multi-scale convolution and vision transformer
Abstract
With the help of some modern image generative techniques, it is possible to generate or manipulate image or video contents without introducing any obvious visual artifacts. If these manipulated images/videos are abused, it probably has ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted
ISSN:1551-6857
EISSN:1551-6865
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 23 January 2024
- Accepted: 8 January 2024
- Revised: 20 November 2023
- Received: 20 April 2023
Published in tomm Just Accepted

Check for updates
Author Tags
deepfake video
detection
convolutional neural network
vision transformer
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 514
  Total Downloads
- Downloads (Last 12 months)514
- Downloads (Last 6 weeks)156
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deepfake Detection Using Spatiotemporal Transformer

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Deep Convolutional Pooling Transformer for Deepfake Detection

Efficient deepfake detection using shallow vision transformer

DeepFake detection with multi-scale convolution and vision transformer

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Deepfake Detection Using Spatiotemporal Transformer

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Deep Convolutional Pooling Transformer for Deepfake Detection

Efficient deepfake detection using shallow vision transformer

DeepFake detection with multi-scale convolution and vision transformer

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media