Actor-Independent Emotion Recognition: A Transformer-Based Approach to Speech and Facial Feature Fusion Conference

Rojas, F, Madanian, S, Templeton, JM et al. (2025). Actor-Independent Emotion Recognition: A Transformer-Based Approach to Speech and Facial Feature Fusion . 203-209. 10.1109/ICDH67620.2025.00037

cited authors

  • Rojas, F; Madanian, S; Templeton, JM; Poellabauer, C; Schneider, SL; Theadom, A

abstract

  • This study explores the feasibility of combining speech and facial expression features for emotion detection using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). We experiment with transfer learning, Transformer-based architectures, 3D convolutional neural networks (3DCNNs), and positional encoders. Our proposed method involves an end-to-end late fusion multimodal training strategy capable of handling heterogeneous sampling rates, achieving an accuracy of 78.10% under 5-fold cross-validation (5KfoldCV), outperforming unimodal architectures. Key methodological considerations include the use of actor-independent splits preventing leakage and post-hoc performance analysis at the individual subject level.

publication date

  • January 1, 2025

Digital Object Identifier (DOI)

start page

  • 203

end page

  • 209