An enhanced speech emotion recognition using vision transformer.

Abstract

In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users’ emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model’s capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.© 2024. The Author(s).

Show Full Text

An enhanced speech emotion recognition using vision transformer.

Researchers

Journal

Modalities

Models

Abstract

Enhancement of Diabetic Retinopathy Prognostication Using Deep Learning, CLAHE, and ESRGAN.

Epileptic Seizure Detection in EEG Signals Using a Unified Temporal-Spectral Squeezeand-Excitation Network.

Automatic Visual Acuity Loss Prediction in Children with Optic Pathway Gliomas using Magnetic Resonance Imaging.

ResLT: Residual Learning for Long-tailed Recognition.

An Intelligent Multi-View Active Learning Method Based on a Double-Branch Network.

SorghumWeedDataset_Classification and SorghumWeedDataset_Segmentation datasets for classification, detection, and segmentation in deep learning.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply