Abstract

The quick evolution of artificial intelligence and generative models has made it possible to produce hyper-realistic fake media, also known as deepfakes. Although such technology has potential uses in entertainment and education, it is a major threat when used for misinformation, identity theft, or defamation. This research fills the critical need for strong and scalable detection mechanisms by investigating multimodal deep learning methods to identify deepfakes in real-world applications. This work provides an in-depth comparison of different deep learning models, including convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and transformer models for the detection of fake video content. The uniqueness of this work lies in its focus on multimodal information, examining visual and auditory features for better detection. The models are trained and tested on benchmarking datasets like FaceForensics++ and the Deepfake Detection Challenge (DFDC), maintaining diversity and realism while testing. This work's experimental results prove that multimodal methods far exceed unimodal models, especially in identifying subtle forgeries under adverse conditions like compression and occlusion. Out of the configurations tried, a hybrid model integrating ResNet-50 for visual frames and Bi-LSTM for audio streams achieved an accuracy rate of 94.6% on the DFDC test set and exhibited excellent generalizability. In addition, our research identifies key issues in actual deployment, including adversarial attacks, dataset bias, and computational cost. To address these, we introduce methods such as data augmentation, domain adaptation, and model compression without severely degrading performance.

Comments

tpp1375

Share

COinS