Actor-Independent Emotion Recognition: A Transformer-Based Approach to Speech and Facial Feature Fusion
Conference
Rojas, F, Madanian, S, Templeton, JM et al. (2025). Actor-Independent Emotion Recognition: A Transformer-Based Approach to Speech and Facial Feature Fusion
. 203-209. 10.1109/ICDH67620.2025.00037
Rojas, F, Madanian, S, Templeton, JM et al. (2025). Actor-Independent Emotion Recognition: A Transformer-Based Approach to Speech and Facial Feature Fusion
. 203-209. 10.1109/ICDH67620.2025.00037
This study explores the feasibility of combining speech and facial expression features for emotion detection using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). We experiment with transfer learning, Transformer-based architectures, 3D convolutional neural networks (3DCNNs), and positional encoders. Our proposed method involves an end-to-end late fusion multimodal training strategy capable of handling heterogeneous sampling rates, achieving an accuracy of 78.10% under 5-fold cross-validation (5KfoldCV), outperforming unimodal architectures. Key methodological considerations include the use of actor-independent splits preventing leakage and post-hoc performance analysis at the individual subject level.