一种基于语音、文本和表情的多模态情感识别算法

吴晓; 牟璇; 刘银华; 刘晓瑞

doi:10.16152/j.cnki.xdxbzr.2024-02-004

您当前的位置：

首页 >

文章列表页 >

一种基于语音、文本和表情的多模态情感识别算法

人工智能情感计算 | 更新时间：2024-04-19

- 一种基于语音、文本和表情的多模态情感识别算法
- A multimodal emotion recognition algorithm based on speech, text and facial expression
- 西北大学学报（自然科学版） 2024年54卷第2期页码：177-187
- 作者机构：
  
  1.青岛大学　自动化学院，山东　青岛　266071
  2.青岛大学　未来研究院，山东　青岛　266071
  3.山东省工业控制技术重点实验室，山东　青岛　266071
- 作者简介：
  
  吴晓，男，从事多模态情感识别研究，3208602731@qq.com。
  刘晓瑞，男，博士，从事机器人学、人机社会交互研究，liuxiaorui@qdu.edu.cn。
- 基金信息：
  
  国家重点研发计划“智能机器人”专项资助项目(2020YFB1313600);青岛市自然科学基金资助项目(23-2-1-126-zyyd-jch);山东省高等学校优秀青年创新团队支持计划项目(2022KJ142)
- DOI：10.16152/j.cnki.xdxbzr.2024-02-004
  中图分类号： TP391
- 纸质出版日期：2024-04-25，
  
  收稿日期：2023-10-13，
扫描看全文
吴晓, 牟璇, 刘银华, 等. 一种基于语音、文本和表情的多模态情感识别算法[J]. 西北大学学报（自然科学版）, 2024,54(2):177-187.

WU Xiao, MOU Xuan, LIU Yinhua, et al. A multimodal emotion recognition algorithm based on speech, text and facial expression[J]. Journal of Northwest University (Natural Science Edition), 2024,54(2):177-187.
吴晓, 牟璇, 刘银华, 等. 一种基于语音、文本和表情的多模态情感识别算法[J]. 西北大学学报（自然科学版）, 2024,54(2):177-187. DOI： 10.16152/j.cnki.xdxbzr.2024-02-004.

WU Xiao, MOU Xuan, LIU Yinhua, et al. A multimodal emotion recognition algorithm based on speech, text and facial expression[J]. Journal of Northwest University (Natural Science Edition), 2024,54(2):177-187. DOI： 10.16152/j.cnki.xdxbzr.2024-02-004.

摘要

针对当前多模态情感识别算法在模态特征提取、模态间信息融合等方面存在识别准确率偏低、泛化能力较差的问题，提出了一种基于语音、文本和表情的多模态情感识别算法。首先，设计了一种浅层特征提取网络(Sfen)和并行卷积模块(Pconv)提取语音和文本中的情感特征，通过改进的Inception-ResnetV2模型提取视频序列中的表情情感特征；其次，为强化模态间的关联性，设计了一种用于优化语音和文本特征融合的交叉注意力模块；最后，利用基于注意力的双向长短期记忆(BiLSTM based on attention mechanism，BiLSTM-Attention)模块关注重点信息，保持模态信息之间的时序相关性。实验通过对比3种模态不同的组合方式，发现预先对语音和文本进行特征融合可以显著提高识别精度。在公开情感数据集CH-SIMS和CMU-MOSI上的实验结果表明，所提出的模型取得了比基线模型更高的识别准确率，三分类和二分类准确率分别达到97.82%和98.18%，证明了该模型的有效性。

Abstract

Aiming at the problems of low recognition accuracy and poor generalization ability of current multimodal emotion recognition algorithms in modal feature extraction and information fusion between modalities

a multimodal emotion recognition algorithm based on speech

text and expression is proposed. Firstly

a shallow feature extraction network (Sfen) combined with parallel convolution module (Pconv) is designed to extract the emotional features in speech and text. A modified Inception-ResnetV2 model is adopted to capture the emotional features of expression in video stream. Secondly

in order to strengthen the correlation among modalities

a cross attention module is designed to optimize the fusion between speech and text modalities. Finally

a bidirectional long and short-term memory module based on attention mechanism (BiLSTM-Attention) is used to focus on key information and maintain the temporal correlation between modalities. By comparing the different combinations of the three modalities

it is found that the hierarchical fusion strategy that processes speech and text in advance can obviously improve the accuracy of the model. Experimental results on the public emotion datasets CH-SIMS and CMU-MOSI show that the proposed model achieves higher recognition accuracy than the baseline model

with three-class and two-class accuracy reaching 97.82% and 98.18% respectively

which proves the effectiveness of the model.

关键词

多模态情感识别并行卷积交叉注意力

Keywords

multimodalemotion recognitionparallel convolutioncross attention

references

李霞, 卢官明, 闫静杰, 等. 多模态维度情感预测综述[J]. 自动化学报, 2018, 44(12): 2142-2159.

LI X, LU G M, YAN J J, et al. A review of multimodal dime-nsional sentiment prediction[J]. Journal of Auctomatica Sinica, 2018, 44(12): 2142-2159.

RISH I. An empirical study of the naive Bayes classifier [J]. Journal of Universal Computer Science, 2001, 1(2):127.

赵健, 周莉芸, 武孟青, 等. 基于人工智能的抑郁症辅助诊断方法[J]. 西北大学学报(自然科学版), 2023, 53(3): 325-335.

ZHAO J, ZHOU L Y, WU M Q, et al. Assistant diagnosis method of depression based on artificial intelligence [J]. Journal of Northwest University (Natural Science Edition), 2023, 53(3): 325-335.

LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.

ELMAN J L. Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179-211.

MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133.

XU D L, TIAN Z H, LAI R F, et al. Deep learning based emotion analysis of microblog texts[J]. Information Fusion, 2020, 64: 1-11.

郑剑, 郑炽, 刘豪, 等. 融合局部特征与两阶段注意力权重学习的面部表情识别[J]. 计算机应用研究, 2022, 39(3): 889-894.

ZHENG J, ZHENG C, LIU H, et al. Deep convolutional neural network fusing local feature and two-stage attention weight learning for facial expression recognition[J]. Application Research of Computers, 2022, 39(3): 889-894.

DUTTA K, SARMA K K. Multiple feature extraction for RNN-based Assamese speech recognition for speech to text conversion application[C]//2012 International Conference on Communications, Devices and Intelligent Systems. Kolkata: IEEE, 2012: 600-603.

HOU M, TANG J J, ZHANG J H, et al. Deep multimodal multilinear fusion with high-order polynomial pooling[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019:12156-12166.

ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017: 1103-1114.

LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018: 2247-2256.

ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5634-5641.

TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[J]. Proceedings of the Conference Association for Computational Linguistics Meeting, 2019, 2019: 6558-6569.

YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797.

ZHAO J, ZHANG M, HE C, et al. A novel facial attractiveness evaluation system based on face shape, facial structure features and skin[J]. Cognitive Neurodynamics, 2020, 14(5): 643-656.

贾宁, 郑纯军. 融合音频、文本、表情动作的多模态情感识别[J]. 应用科学学报, 2023, 41(1): 55-70.

JIA N, ZHENG C J. Multimodal emotion recognition by fusing audio, text, and expression-action[J]. Journal of Applied Sciences, 2023, 41(1): 55-70.

WANG Y Y, GU Y, YIN Y F, et al. Multimodal transformer augmented fusion for speech emotion recognition[J]. Frontiers in Neurorobotics, 2023, 17: 1181598.

焦亚萌, 周成智, 李文萍, 等. 融合多头注意力的VGGNet语音情感识别研究[J]. 国外电子测量技术, 2022, 41(1): 63-69.

JIAO Y M, ZHOU C Z, LI W P, et al. Research on speech emotion recognition with VGGNet incorporating multi-headed attention [J]. Foreign Electronic Measurement Technology, 2022, 41(1): 63-69.

ZHANG Y M, SUN M H, REN Y, et al. Sentiment analysis of sina weibo users under the impact of super typhoon lekima using natural language processing tools: A multi-tags case study[J]. Procedia Computer Science, 2020, 174: 478-490.

刘亚姝, 侯跃然, 严寒冰. 基于异质信息网络的恶意代码检测[J]. 北京航空航天大学学报, 2022, 48(2): 258-265.

LIU Y S, HOU Y R, YAN H B. Malicious code detection based on heterogeneous information networks[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(2): 258-265.

邱世振, 白靖文, 张晋行, 等. 基于六轴机械臂驱动的微波球面扫描成像系统[J]. 电子测量与仪器学报, 2023, 37(4): 98-106.

QIU S Z, BAI J W, ZHANG J X, et al. Microwave spherical scanning imaging system driven by six-axis manipulator [J]. Journal of Electronic Measurement and Instrumentation, 2023, 37(4): 98-106.

KU H C, DONG W. Face recognition based on MTCNN and convolutional neural network[J]. Frontiers in Signal Processing, 2020, 4(1): 37-42.

付而康, 周佳玟, 姚智, 等. 基于机器视觉识别的户外环境情绪感受测度研究[J]. 景观设计学(中英文), 2021, 9(5): 46-59.

FU E K, ZHOU J C, YAO Z, et al. A study on the measurement of emotional feelings in outdoor environments based on machine vision recognition[J]. Landscape Architecture Frontiers, 2021, 9(5): 46-59.

ZHANG K, SUN M, HAN T X, et al. Residual networks of residual networks: Multilevel residual networks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(6): 1303-1314.

DING W, HUANG Z Y, HUANG Z K, et al. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA[J]. Journal of Systems Architecture, 2019, 97(C): 278-286.

梁宏涛, 刘硕, 杜军威, 等. 深度学习应用于时序预测研究综述[J]. 计算机科学与探索, 2023, 17(6): 1285-1300.

LIANG H T, LIU S, DU J W, et al. Research review on application of deep learning to time series prediction [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(6): 1285-1300.

焦义, 徐华兴, 毛晓波, 等. 融合多尺度特征的脑电情感识别研究[J]. 计算机工程, 2023, 49(5): 81-89.

JIAO Y, XU H X, MAO X B, et al. Research on EEG emotion recognition by fusing multi-scale features[J]. Computer Engineering, 2023, 49(5): 81-89.

XU Y R, SU H, MA G J, et al. A novel dual-modal emotion recognition algorithm with fusing hybrid features of audio signal and speech context[J]. Complex & Intelligent Systems, 2023, 9(1): 951-963.

王兰馨, 王卫亚, 程鑫. 结合Bi-LSTM-CNN的语音文本双模态情感识别模型[J]. 计算机工程与应用, 2022, 58(4): 192-197.

WANG L X, WANG W Y, CHENG X. Combined Bi-LSTM-CNN for speech-text bimodal emotion recognition model[J]. Computer Engineering and Applications, 2022, 58(4): 192-197.

祁宣豪, 智敏. 图像处理中注意力机制综述[J]. 计算机科学与探索, 2024, 18(2):345-362.

QI X H, ZHI M. A review of attention mechanisms in image processing [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2):345-362.

YU W M, XU H, MENG F P, et al. CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020: 3718-3727.

ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. (2016-08-12)[2023-09-25]. http://arxiv.org/abs/1606.06259http://arxiv.org/abs/1606.06259.

ZHANG X C, QIU X P, PANG J M, et al. Dual-axial self-attention network for text classification[J]. Science China Information Sciences, 2021, 64(12): 80-90.

WANG Y, SONG W, TAO W, et al. A systematic review on affective computing: Emotion models, databases, and recent advances[J]. Information Fusion, 2022, 83/84: 19-52.

RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained transformers[J]. Proceedings of the Conference Association for Computational Linguistics Meeting, 2020, 2020: 2359-2369.

HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[EB/OL]. (2021-09-16)[2023-09-25]. http://arxiv.org/abs/2109.00412http://arxiv.org/abs/2109.00412.

HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1122-1131.

程子晨, 李彦, 葛江炜, 等. 利用信息瓶颈的多模态情感分析[J]. 计算机工程与应用, 2024, 60(2):137-146.

CHENG Z C, LI Y, GE J W, et al. Multi-modal sentiment analysis using information bottleneck [J]. Computer Engineering and Applications, 2024, 60(2):137-146.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据