Cross-modal analysis of facial EMG in micro-expressions and data annotation algorithm

doi:10.3724/SP.J.1042.2024.00001

Abstract

Abstract:

Micro-expression analysis combined with deep learning has become a major trend. However, the small sample size problem has always hindered the further development of micro-expression analysis relying on deep learning. Micro-expressions are brief, subtle facial expressions, so the time cost and labor cost of micro-expression data annotation are very high, which leads to the problem of small sample size. To further improve the performance of micro-expression spotting and recognition, a huge amount of micro-expression samples is still needed for deep learning model training. Consequently, this research direction has an urgent desire to solve the problem of micro-expression data annotation. To address this issue, our research uses facial electromyographic (EMG) signals as a technical means to propose a set of solutions to the problem of micro-expression annotation from three aspects: automatic annotation, semi-automatic annotation, and unsupervised annotation of micro-expression data.

First, we use physiological psychology methods to combine facial EMG signals and behavioral cognitive psychology experiments to explore the physiological characteristics of micro-expressions. In this study, we recorded the signal frequency and amplitude during the contraction of facial muscles or muscle groups. And relevant EMG metrics were used to accurately and objectively quantify the three features of micro-expressions, namely, short presentation time, small movement amplitude, and asymmetry, to provide a theoretical basis for subsequent research on annotation and intelligent analysis of micro-expressions.

Second, for automatic annotation, this study proposes an automatic annotation scheme for micro-expressions based on distal facial electromyography. Specifically, we deploy EMG electrodes around the face without obscuring the facial expression being expressed. In this way, automatic annotation of micro-expression data by combining EMG information with video is implemented. Meantime, we design a psychological paradigm for inducing facial muscle movements. And based on the electromyographic signal pattern of micro-expressions, we develop an algorithm for automatic micro-expression annotation. Finally, we integrated the automatic annotation process and designed an automated annotation interactive software, which can greatly save the time of micro-expression annotation, reduce the workload of micro-expression coders, and solve the problem of small samples in micro-expression database to a certain extent.

Third, for semi-automatic annotation, we focus on the temporal action localization of micro-expressions (METL), i.e., the process of inferring the onset and offset frames of a micro-expression segment, based on the manual annotation of a single frame within that micro-expression. In particular, we propose a Micro-Expression Contrastive Identification Annotation (MECIA) method as a solution to METL. The backbone of the proposed MECIA method is a deep learning network. The network contains three modules: a contrastive module, an identification module, and an annotation module, corresponding to the three steps of manual annotation. The network's outputs infer the temporal localization of micro-expression clips. The experiments demonstrate that our inferred micro-expression intervals can correspond well to ground-truth intervals, demonstrating the potential of this approach to improve the efficiency of vision-based micro-expression annotation.

Fourth, for unsupervised annotation, due to the limited number of annotated micro-expression samples, we propose a self-supervised learning-based micro-expression analysis algorithm implemented in massive unsupervised annotation face and expression videos. Precisely, we provide time-domain supervised information for unsupervised annotation face videos based on the correspondence between facial EMG and facial expressions. And we design a Transformer-based self-supervised model for cross-modal contrastive learning, which utilizes EMG signals to enhance network learning of features targeting micro-expression action change patterns. Specifically, the introduction of EMG signals enhances the contrastive learning model to capture the weak dynamic facial changes in the time domain. This self-supervised learning model incorporating EMG signals can strengthen the model's understanding of visual features. In addition, cross-modal learning allows the model to learn more generalized features and enhance the robustness of the system.

Key words: image annotation, micro-expression analysis, distal facial electromyography, micro-expression data annotation

CLC Number:

B841

WANG Su-Jing, WANG Yan, Li Jingting, DONG Zizhao, ZHANG Jianhang, LIU Ye. Cross-modal analysis of facial EMG in micro-expressions and data annotation algorithm[J]. Advances in Psychological Science, 2024, 32(1): 1-13.

Figures/Tables 10

References 48

[1]	李晓明, 傅小兰, 邓国峰. (2008). 中文简化版PAD情绪量表在京大学生中的初步试用. 中国心理卫生杂志, 22(5), 327-329.
[2]	Ben, X., Ren, Y., Zhang, J., Wang, S.-J., Kpalma, K., Meng, W., & Liu, Y.-J. (2021). Video-based facial micro- expression analysis: A survey of datasets, features and algorithms. In IEEE Transactions on Pattern Analysis and Machine Intelligence (Vol. 44, pp. 5826-5846). Singapore.
[3]	Chao, Y.-W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018, June). Rethinking the faster r-cnn architecture for temporal action localization. Paper presented at the meeting of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1130- 1139). Salt Lake City, UTAH.
[4]	Darwin, C. (1872). The expression of the emotions in man and animals. London, UK: John Marry.
[5]	Davison, A., Merghani, W., Lansley, C., Ng, C.-C., & Yap, M. H. (2018, May). Objective micro-facial movement detection using facs-based regions and baseline evaluation. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) (pp. 642-649). China.
[6]	Davison, A. K., Lansley, C., Costen, N., Tan, K., & Yap, M. H. (2018). SAMM: A spontaneous micro-facial movement dataset. IEEE Transactions on Affective Computing, 9(1), 116-129. doi: 10.1109/TAFFC.2016.2573832 URL
[7]	Ding, J., Tian, Z., Lyu, X., Wang, Q., Zou, B., & Xie, H. (2019, September). Real-time micro-expression detection in unlabeled long videos using optical flow and lstm neural network. In International Conference on Computer Analysis of Images and Patterns (pp. 622-634). Springer, Cham.
[8]	Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision (pp. 1422-1430). Chile.
[9]	Ekman, P. (2004). Emotions revealed. British Medical Journal, 328(Suppl. 5), 0405184.
[10]	Ekman, P., & Friesen, W. V. (1969). Nonverbal leakage and clues to deception. Psychiatry, 32(1), 88-106. doi: 10.1080/00332747.1969.11023575 pmid: 5779090
[11]	Fernando, B., Bilen, H., Gavves, E., & Gould, S. (2017, July). Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3636-3645). Hawaii, Hawaii Convention Center.
[12]	Gao, Y., Vedula, S. S., Reiley, C. E., Ahmidi, N., Varadarajan, B., Lin, H. C.,... Hager, G. (2014, September). Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. Paper presented at the meeting of MICCAI workshop: M2cai (Vol. 3). New York, NY.
[13]	Gruebler, A., & Suzuki, K. (2014). Design of a wearable device for reading positive expressions from facial emg signals. IEEE Transactions on Affective Computing, 5(3), 227-237. doi: 10.1109/T-AFFC.5165369 URL
[14]	Hamedi, M., Salleh, S.-H., Astaraki, M., & Noor, A. M. (2013). EMG-based facial gesture recognition through versatile elliptic basis function neural network. Biomedical Engineering Online, 12, 73. doi: 10.1186/1475-925X-12-73 pmid: 23866903
[15]	Hess, U. (2009). Facial EMG. Methods in social neuroscience (pp.70-91). NY: The Guilford Press.
[16]	Höfling, T. T. A., Gerdes, A. B., Föhl, U., & Alpers, G. W. (2020). Read my face: Automatic facial coding versus psychophysiological indicators of emotional valence and arousal. Frontiers in Psychology, 11, 1388. doi: 10.3389/fpsyg.2020.01388 pmid: 32636788
[17]	Jing, L., & Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037-4058. doi: 10.1109/TPAMI.2020.2992393 URL
[18]	Larsson, G., Maire, M., & Shakhnarovich, G. (2017, July). Colorization as a proxy task for visual understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6874-6883). Hawaii, Hawaii Convention Center.
[19]	LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. doi: 10.1038/nature14539
[20]	Lee, P., Uh, Y., & Byun, H. (2020, April). Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11320-11327). Vancouver, Canada.
[21]	Lee, Y. J., Ghosh, J., & Grauman, K. (2012, June). Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1346-1353). Providence, USA.
[22]	Li, J., Dong, Z., Lu, S., Wang, S.-J., Yan, W.-J., Ma, Y., Liu, Y., Huang, C., & Fu, X. (2022). CAS (ME) ³: A third generation facial spontaneous micro-expression database with depth information and high ecological validity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 2782-2800.
[23]	Li, X., Cheng, S., Li, Y., Behzad, M., Shen, J., Zafeiriou, S., Pantic, M., & Zhao, G. (2022). 4DME: A spontaneous 4D micro-expression dataset with multimodalities. IEEE Transactions on Affective Computing Early Access, 1-18. https://doi.org/10.1109/TAFFC.2022.3182342
[24]	Li, X., Liu, S., de Mello, S., Wang, X., Kautz, J., & Yang, M.-H. (2019). Joint-task self-supervised learning for temporal correspondence. Advances in Neural Information Processing Systems, 32.
[25]	Li, X., Pfister, T., Huang, X., Zhao, G., & Pietikäinen, M. (2013, April). A spontaneous micro-expression database: Inducement, collection and baseline. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai, China. https://doi.org/10.1109/fg.2013.6553717
[26]	Liong, S.-T., See, J., Wong, K., & Phan, R. C.-W. (2016, November). Automatic micro-expression recognition from long video using a single spotted apex. In Computer Vision-ACCV 2016 Workshops: ACCV 2016 International Workshops (pp. 345-360). Taipei, Taiwan.
[27]	Liu, D., Jiang, T., & Wang, Y. (2019, June). Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1298-1307). Long Beach, USA.
[28]	Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., & Mei, T. (2019, June). Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 344-353). Long Beach, USA.
[29]	Moilanen, A., Zhao, G., & Pietikäinen, M. (2014, August). Spotting rapid facial movements from videos using appearance-based feature difference analysis. In Proceedings-International Conference on Pattern Recognition (pp. 1722-1727). Stockholm, Sweden. https://doi.org/10.1109/ICPR.2014.303
[30]	Pan, H., Xie, L., & Wang, Z. (2021, October). Spatio- temporal convolutional attention network for spotting macro-and micro-expression intervals. In Proceedings of the 1st Workshop on Facial Micro-Expression: Advanced Techniques for Facial Expressions Generation and Spotting (pp. 25-30). New York, NY.
[31]	Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016, June). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2536- 2544). Las Vegas, Nevada.
[32]	Perusquía-Hernández, M., Dollack, F., Tan, C. K., Namba, S., Ayabe-Kanamura, S., & Suzuki, K. (2021, December). Smile action unit detection from distal wearable electromyography and computer vision. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021) (pp. 1-8). Jodhpur, India.
[33]	Qu, F., Wang, S.-J., Yan, W.-J., Li, H., Wu, S., & Fu, X. (2018). CAS (ME)²: A database for spontaneous macro-expression and micro-expression spotting and recognition. IEEE Transactions on Affective Computing, 9(4), 424-436. doi: 10.1109/T-AFFC.5165369 URL
[34]	Rinn, W. E. (1984). The neuropsychology of facial expression: A review of the neurological and psychological mechanisms for producing facial expressions. Psychological Bulletin, 95(1), 52-77. pmid: 6242437
[35]	Sato, W., Murata, K., Uraoka, Y., Shibata, K., Yoshikawa, S., & Furuta, M. (2021). Emotional valence sensing using a wearable facial EMG device. Scientific Reports, 11(1), 5757.
[36]	Schultz, I., & Pruzinec, M. (2010). Facial expression recognition using surface electromyography (Unpublished doctoral dissertation). Karlruhe Institute of Technology.
[37]	Sun, B., Cao, S., Li, D., He, J., & Yu, L. (2020). Dynamic micro-expression recognition using knowledge distillation. IEEE Transactions on Affective Computing, 13(2), 1037-1043. doi: 10.1109/TAFFC.2020.2986962 URL
[38]	Torre, F. D. l., Simon, T., Ambadar, Z., & Cohn, J. F. (2011, October). Fast-FACS: A computer-assisted system to increase speed and reliability of manual FACS coding. In Affective Computing and Intelligent Interaction: 4th International Conference (pp. 57-66). Springer Berlin Heidelberg.
[39]	Vishwakarma, S., & Agrawal, A. (2013). A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer, 29(10), 983-1009. doi: 10.1007/s00371-012-0752-6 URL
[40]	Wang, S.-J., He, Y., Li, J., & Fu, X. (2021). MESNet: A convolutional neural network for spotting multi-scale micro-expression intervals in long videos. IEEE Transactions on Image Processing, 30, 3956-3969. https://doi.org/10.1109/tip.2021.3064258 doi: 10.1109/TIP.2021.3064258 URL
[41]	Wang, S.-J., Li, B.-J., Liu, Y.-J., Yan, W.-J., Ou, X., Huang, X., Xu, F., & Fu, X. (2018). Micro-expression recognition with small sample size by transferring long-term convolutional neural network. Neurocomputing, 312, 251-262. doi: 10.1016/j.neucom.2018.05.107 URL
[42]	Wang, S.-J., Wu, S., Qian, X., Li, J., & Fu, X. (2017). A main directional maximal difference analysis for spotting facial movements from long-term videos. Neurocomputing, 230, 382-389. doi: 10.1016/j.neucom.2016.12.034 URL
[43]	Xia, B., Wang, W., Wang, S., & Chen, E. (2020, October). Learning from macro-expression: A micro-expression recognition framework. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 2936-2944). Lisbon, Portugal.
[44]	Yan, W.-J., Li, X., Wang, S.-J., Zhao, G., Liu, Y.-J., Chen, Y.-H., & Fu, X. (2014). CASME II: An improved spontaneous micro-expression database and the baseline evaluation. Plos One, 9(1), Article e86041.
[45]	Yan, W.-J., Wu, Q., Liang, J., Chen, Y.-H., & Fu, X. (2013). How fast are the leaked facial expressions: The duration of micro-expressions. Journal of Nonverbal Behavior, 37(4), 217-230. doi: 10.1007/s10919-013-0159-8 URL
[46]	Yan, W.-J., Wu, Q., Liu, Y.-J., Wang, S.-J., & Fu, X. (2013, April). CASME database: A dataset of spontaneous micro-expressions collected from neutralized faces. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai, China.
[47]	Yang, B., Wu, J., Zhou, Z., Komiya, M., Kishimoto, K., Xu, J., Nonaka, K., Horiuchi, T., Komorita, S., Hattori, G., Naito, S., & Takishima, Y. (2021, October). Facial action unit-based deep learning framework for spotting macro-and micro-expressions in long video sequences. In Proceedings of the 29th ACM International Conference on Multimedia (pp. 4794-4798). Chengdu, China.
[48]	Yu, W.-W., Jiang, J., & Li, Y.-J. (2021, October). LSSNet: A two-stream convolutional neural network for spotting macro-and micro-expression in long videos. In Proceedings of the 29th ACM International Conference on Multimedia (pp. 4745-4749). Chengdu, China.