A Speech Emotion Recognition Approach Using Discrete Wavelet Transform and Deep Learning Techniques in a Brazilian Portuguese Corpus

Rafael Greca Vieira; Alessandra Alaniz Macedo

doi:10.22456/2175-2745.143123

Authors

Rafael Greca Vieira University of São Paulo (USP) https://orcid.org/0009-0007-3924-6074
Alessandra Alaniz Macedo University of São Paulo (USP) https://orcid.org/0000-0001-5271-3086

DOI:

https://doi.org/10.22456/2175-2745.143123

Keywords:

natural language processing, speech emotion recognition, deep learning, convolutional neural network, discrete wavelet transform

Abstract

The study of emotion encompasses the human mind's cognitive processes and psychological states. With the rapid advancement and declining costs of technology, researchers have become increasingly focused on capturing voice, gestures, facial expressions, and other expressions of emotion. In this study, we combine a Deep Learning model with the Wavelet Transform technique for the task of Speech Emotion Recognition, which aims to detect and identify emotions in informal and spontaneous speech, part of a Brazilian-Portuguese corpus, achieving a macro F1-score of 0.566 and a ROC-AUC score of 0.7217 on the CORAA database, while surpassing the results achieved in another work presented at the International Conference on Computational Processing of Portuguese Language 2022, which uses the same architecture together with transfer learning techniques, by up to 11% macro F1. Our methodology integrates a deep learning model with advanced signal processing techniques. Specifically, we leverage a pre-trained large-scale neural network architecture tailored for audio analysis, incorporating Discrete Wavelet Transform and Mel Spectrogram features to enhance the model’s performance. Additionally, we apply the SpecAugment technique for effective data augmentation. Our approach is positioned as the second-best overall and the top-performing method among those that do not utilize open-set techniques, such as using other datasets or using transfer learning techniques during the model training, while being one of the few works that excelled the proposed baselines when compared with the works presented at the event.

Downloads

Download data is not yet available.

References

[1] KONAR, A.; CHAKRABORTY, A. Emotion recognition: A pattern analysis approach. [S.l.]: John Wiley & Sons, 2015. 10.1002/9781118910566.

[2] LALITHA, S. et al. Speech emotion recognition using dwt. In: IEEE. 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). [S.l.], 2015. p. 1–4. 10.1109/ICCIC.2015.7435630.

[3] RISSATO, P. H. D. G.; MACEDO, A. A. Sofiafala: Software inteligente de apoio à fala. In: SBC. Anais Estendidos do XXVII Simpósio Brasileiro de Sistemas Multimídia e Web. [S.l.], 2021. p. 91–94. 10.5753/webmedia_estendido.2021.17620.

[4] MACEDO, A. A. et al. A mobile application and system architecture for online speech training in portuguese: design, development, and evaluation of sofiafala. Multimedia Tools and Applications, Springer, p. 1–30, 2024.

[5] RAMAKRISHNAN, S.; EMARY, I. M. E. Speech emotion recognition approaches in human computer interaction. Telecommunication Systems, Springer, v. 52, p. 1467–1478, 2013. 10.1007/s11235-011-9624-z.

[6] ALNUAIM, A. A. et al. Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. Journal of Healthcare Engineering, Hindawi, v. 2022, 2022. 10.1155/2022/6005446.

[7] ISSA, D.; DEMIRCI, M. F.; YAZICI, A. Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, Elsevier, v. 59, p. 101894, 2020. 10.1016/j.bspc.2020.101894.

[8] KODURU, A.; VALIVETI, H. B.; BUDATI, A. K. Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, Springer, v. 23, n. 1, p. 45–55, 2020. 10.1007/s10772-020-09672-4.

[9] SEKKATE, S. et al. An investigation of a feature-level fusion for noisy speech emotion recognition. Computers, MDPI, v. 8, n. 4, 2019. 10.3390/computers8040091.

[10] BADSHAH, A. M. et al. Speech emotion recognition from spectrograms with deep convolutional neural network. In: IEEE. 2017 international conference on platform technology and service (PlatCon). [S.l.], 2017. p. 1–5. 10.1109/PlatCon.2017.7883728.

[11] LIM, W.; JANG, D.; LEE, T. Speech emotion recognition using convolutional and recurrent neural networks. In: IEEE. 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA). [S.l.], 2016. p. 1–4. 10.1109/APSIPA.2016.7820699.

[12] GAUY, M. M.; FINGER, M. Pretrained audio neural networks for speech emotion recognition in portuguese. arXiv preprint arXiv:2210.14716, 2022. 10.48550/arXiv.2210.14716.

[13] SIFUZZAMAN, M.; ISLAM, M. R.; ALI, M. Z. Application of wavelet transform and its advantages compared to fourier transform. Citeseer, 2009.

[14] POLIKAR, R. et al. The wavelet tutorial. 1996.

[15] GUO, T. et al. A review of wavelet analysis and its applications: Challenges and opportunities. IEEE Access, IEEE, v. 10, p. 58869–58903, 2022. 10.1109/ACCESS.2022.3179517.

[16] LOPE, J. de; GRAÑA, M. An ongoing review of speech emotion recognition. Neurocomputing, Elsevier, v. 528, p. 1–11, 2023. 10.1016/j.neucom.2023.01.002.

[17] HUANG, C.-T. et al. Speech emotion recognition applied to real-world medical consultation. In: MEDINFO 2023—The Future Is Accessible. [S.l.]: IOS Press, 2024. p. 1121–1125.

[18] ZEPF, S. et al. Driver emotion recognition for intelligent vehicles: A survey. ACM Computing Surveys (CSUR), ACM New York, NY, USA, v. 53, n. 3, p. 1–30, 2020.

[19] TAN, L. et al. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5g-enabled space–air–ground integrated intelligent transportation system. IEEE Transactions on Intelligent Transportation Systems, IEEE, v. 23, n. 3, p. 2830–2842, 2021.

[20] DESCHAMPS-BERGER, T.; LAMEL, L.; DEVILLERS, L. End-to-end speech emotion recognition: challenges of real-life emergency call centers data recordings. In: IEEE. 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII). [S.l.], 2021. p. 1–8.

[21] DEVILLERS, L.; VIDRASCU, L. Real-life emotion recognition in speech. Speaker Classification II: Selected Projects, Springer, p. 34–42, 2007.

[22] MARCACINI, R. M.; JUNIOR, A. C.; CASANOVA, E. Overview of the automatic speech recognition for spontaneous and prepared speech & speech emotion recognition in portuguese (SE&R) shared-tasks at PROPOR 2022. In: [S.l.]: CEUR-WS, 2022. p. 1–8.

[23] CHEN, L. et al. Speech emotion recognition: Features and classification models. Digital Signal Processing, Elsevier, v. 22, n. 6, p. 1154–1160, 2012. 10.1016/j.dsp.2012.05.007.

[24] SUN, Y.; WEN, G.; WANG, J. Weighted spectral features based on local Hu moments for speech emotion recognition. Biomedical Signal Processing and Control, Elsevier, v. 18, p. 80–90, 2015. 10.1016/j.bspc.2014.10.008.

[25] RIEGER, S. A.; MURALEEDHARAN, R.; RAMACHANDRAN, R. P. Speech based emotion recognition using spectral feature extraction and an ensemble of KNN classifiers. In: IEEE. The 9th International Symposium on Chinese Spoken Language Processing. [S.l.], 2014. p. 589–593. 10.1109/ISCSLP.2014.6936711.

[26] POPOVA, A. S.; RASSADIN, A. G.; PONOMARENKO, A. A. Emotion recognition in sound. In: SPRINGER. Advances in Neural Computation, Machine Learning, and Cognitive Research: Selected Papers from the XIX International Conference on Neuroinformatics, October 2–6, 2017, Moscow, Russia 19. [S.l.], 2018. p. 117–124. 10.1007/978-3-319-66604-4_18.

[27] TANMOY, R. et al. Introducing new feature set based on wavelets for speech emotion classification. In: 2018 IEEE Applied Signal Processing Conference (ASPCON). [S.l.]: IEEE, 2018. p. 124–128. 10.1109/ASPCON.2018.8748666.

[28] AGHAJANI, K.; AFRAKOTI, I. E. P. Speech emotion recognition using scalogram based deep structure. International Journal of Engineering, Materials and Energy Research Center, v. 33, n. 2, p. 285–292, 2020. 10.5829/ije.2020.33.02b.13.

[29] NUGROHO, H.; FUADIYAH, R. N. N. Development of speech emotion recognition system based on discrete wavelet transform (DWT) and voice segmentation. International Journal on Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Bandung Institute of ..., v. 14, n. 3, p. 593–607, 2022. 10.15676/ijeei.2022.14.3.7.

[30] BUSSO, C. et al. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, Springer, v. 42, p. 335–359, 2008. 10.1007/s10579-008-9076-6.

[31] MENG, H. et al. Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks. Bulletin of the Polish Academy of Sciences. Technical Sciences, Polska Akademia Nauk. Czytelnia Czasopism PAN, v. 69, n. 1, 2021. 10.24425/bpasts.2020.136300.

[32] FENG, T.; YANG, S. Speech emotion recognition based on LSTM and Mel scale wavelet packet decomposition. In: Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. [S.l.: s.n.], 2018. p. 1–7. 10.1145/3302425.3302444.

[33] BAO, W. et al. Building a Chinese natural emotional audio-visual database. In: IEEE. 2014 12th International Conference on Signal Processing (ICSP). [S.l.], 2014. p. 583–587. 10.1109/ICOSP.2014.7015071.

[34] HUANG, Y. et al. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing, Springer, v. 10, p. 1787–1798, 2019. 10.1007/s12652-017-0644-8.

[35] DUTT, A.; GADER, P. Wavelet multiresolution analysis based speech emotion recognition system using 1d cnn lstm networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE, 2023. 10.1109/TASLP.2023.3277291.

[36] HAMSA, S. et al. Emotion recognition from speech using wavelet packet transform cochlear filter bank and random forest classifier. IEEE Access, IEEE, v. 8, p. 96994–97006, 2020. 10.1109/ACCESS.2020.2991811.

[37] ALVES, C. et al. Transfer learning and data augmentation techniques applied to speech emotion recognition in SE&R 2022. In: SE&R@PROPOR. [S.l.: s.n.], 2022. p. 25–36.

[38] SHANNON, C. Communication in the presence of noise. Proceedings of the IRE, Institute of Electrical and Electronics Engineers (IEEE), v. 37, n. 1, p. 10–21, jan 1949. Disponível em: https://doi.org/10.1109/jrproc.1949.232969.

[39] PARK, D. S. et al. SpecAugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019. 10.21437/Interspeech.2019-2680.

[40] GEMMEKE, J. F. et al. Audio Set: An ontology and human-labeled dataset for audio events. In: IEEE. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S.l.], 2017. p. 776–780. 10.1109/ICASSP.2017.7952261.

[41] KONG, Q. et al. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE, v. 28, p. 2880–2894, 2020. 10.48550/arXiv.1912.10211.

[42] PHAM, L. et al. An audio-based deep learning framework for BBC television programme classification. In: IEEE. 2021 29th European Signal Processing Conference (EUSIPCO). [S.l.], 2021. p. 56–60.

[43] GANESH, A. H.; XU, B. A review of reinforcement learning based energy management systems for electrified powertrains: Progress, challenge, and potential solution. Renewable and Sustainable Energy Reviews, Elsevier, v. 154, p. 111833, 2022.

[44] ZHANG, H. et al. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 10.48550/arXiv.1710.09412.

[45] KIM, G.; HAN, D. K.; KO, H. SpecMix: A mixed sample data augmentation method for training with time-frequency domain features. arXiv preprint arXiv:2108.03020, 2021. 10.48550/arXiv.2108.03020.

[46] YUN, S. et al. CutMix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.: s.n.], 2019. p. 6023-6032. 10.48550/arXiv.1905.04899.

[47] SEIFFERT, C. et al. Mining data with rare events: A case study. In: IEEE. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007). [S.l.], 2007. v. 2, p. 132-139.

[48] JAPKOWICZ, N. The class imbalance problem: Significance and strategies. In: Proc. of the Int'l Conf. on Artificial Intelligence. [S.l.: s.n.], 2000. v. 56, p. 111-117.

[49] IBRAHIM, K. M.; PERZO, A.; LEGLAIVE, S. Towards improving speech emotion recognition using synthetic data augmentation from emotion conversion. In: IEEE. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S.l.], 2024. p. 10636-10640.

[50] CHATZIAGAPI, A. et al. Data augmentation using gans for speech emotion recognition. In: Interspeech. [S.l.: s.n.], 2019. p. 171-175.

[51] PERIN, E. S.; MATSUBARA, E. T. Transductive ensemble learning with graph neural network for speech emotion recognition. In: SE&R@PROPOR. [S.l.: s.n.], 2022. p. 42-48.

[52] SCARANTI, A. et al. Speech emotion recognition in Portuguese for SofiaFala: SER SofiaFala. In: SE&R@ PROPOR. [S.l.: s.n.], 2022. p. 37-41.

[53] JACKSON, P.; HAQ, S. Surrey Audio-Visual Expressed Emotion (SAVEE) database. University of Surrey: Guildford, UK, 2014.

[54] LIVINGSTONE, S. R.; RUSSO, F. A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, Public Library of Science San Francisco, CA USA, v. 13, n. 5, p. e0196391, 2018.

[55] BURKHARDT, F. et al. A database of German emotional speech. In: Interspeech. [S.l.: s.n.], 2005. v. 5, p. 1517-1520.

[56] SHEGOKAR, P.; SIRCAR, P. Continuous wavelet transform based speech emotion recognition. In: IEEE. 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS). [S.l.], 2016. p. 1-8. 10.1109/ICSPCS.2016.7843306.

[57] ZENG, Y. et al. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, Springer, v. 78, p. 3705-3722, 2019. 10.1007/s11042-017-5539-3.

[58] VASQUEZ-CORREA, J. C. et al. Wavelet-based time-frequency representations for automatic recognition of emotions from speech. In: VDE. Speech Communication; 12. ITG Symposium. [S.l.], 2016. p. 1-5.

[59] JIANG, P. et al. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access, IEEE, v. 7, p. 90368-90377, 2019. 10.1109/ACCESS.2019.2927384.

[60] HAIDER, F. et al. Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods. Computer Speech & Language, Elsevier, v. 65, p. 101119, 2021. 10.48550/arXiv.1908.10623.

[61] SHORTEN, C.; KHOSHGOFTAAR, T. M. A survey on image data augmentation for deep learning. Journal of big data, Springer, v. 6, n. 1, p. 1-48, 2019.

[62] FERREIRA-PAIVA, L. et al. A survey of data augmentation for audio classification. In: Congresso Brasileiro de Automática-CBA. [S.l.: s.n.], 2022. v. 3, n. 1.

[63] CHEN, T. et al. Augmenting radio signals with wavelet transform for deep learning-based modulation recognition. IEEE Transactions on Cognitive Communications and Networking, IEEE, 2024.

[64] PAPPAGARI, R. et al. Copypaste: An augmentation method for speech emotion recognition. In: IEEE. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S.l.], 2021. p. 6324-6328.

[65] DANG, A. et al. Emix: A data augmentation method for speech emotion recognition. In: IEEE. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S.l.], 2023. p. 1-5.

[66] ALENCAR, V.; ALCAIM, A. LSF and LPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In: IEEE. 2008 42nd Asilomar Conference on Signals, Systems and Computers. [S.l.], 2008. p. 1237-1241.

[67] DANESHFAR, F.; KABUDIAN, S. J. Speech emotion recognition using multi-layer sparse auto-encoder extreme learning machine and spectral/spectro-temporal features with new weighting method for data imbalance. In: IEEE. 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE). [S.l.], 2021. p. 419-423.

[68] TRIPATHI, S. et al. Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition. 2019.

[69] ALGHIFARI, M. F. et al. On the use of voice activity detection in speech emotion recognition. Bulletin of Electrical Engineering and Informatics, v. 8, n. 4, p. 1324-1332, 2019.

[70] MILLING, M. et al. Evaluating the impact of voice activity detection on speech emotion recognition for autistic children. Frontiers in Computer Science, Frontiers Media SA, v. 4, p. 837269, 2022.

[71] AKIBA, T. et al. Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. [S.l.: s.n.], 2019. p. 2623-2631.

[72] BERGSTRA, J. et al. Hyperopt: A Python library for model selection and hyperparameter optimization. Computational Science & Discovery, IOP Publishing, v. 8, n. 1, p. 014008, 2015.