基于Transformer的多子空间多模态情感分析

田昌宁; 贺昱政; 王笛; 万波; 郭栩彤

doi:10.16152/j.cnki.xdxbzr.2024-02-002

您当前的位置：

首页 >

文章列表页 >

基于Transformer的多子空间多模态情感分析

人工智能情感计算 | 更新时间：2024-04-19

- 基于Transformer的多子空间多模态情感分析
- Multi-subspace multimodal sentiment analysis method based on Transformer
- 西北大学学报（自然科学版） 2024年54卷第2期页码：156-167
- 作者机构：
  
  1.西安电子科技大学　计算机科学与技术学院，陕西　西安　710071
  2.中国电子科技集团公司第五十四研究所，河北　石家庄　050081
- 作者简介：
  
  田昌宁，男，从事多模态情感计算研究，cntian@stu.xidian.edu.cn。
  王笛，女，副教授，博士生导师，从事情感计算、多模态机器学习研究，wangdi@xidian.edu.cn。
- 基金信息：
  
  国家科技创新2030－“新一代人工智能”重大项目(2022ZD0117103);中央高校基本科研业务费项目(QTZX23084);国家自然科学基金面上项目(62072354)
- DOI：10.16152/j.cnki.xdxbzr.2024-02-002
  中图分类号： TP391.1
- 纸质出版日期：2024-04-25，
  
  收稿日期：2023-12-09，
扫描看全文
田昌宁, 贺昱政, 王笛, 等. 基于Transformer的多子空间多模态情感分析[J]. 西北大学学报（自然科学版）, 2024,54(2):156-167.

TIAN Changning, HE Yuzheng, WANG Di, et al. Multi-subspace multimodal sentiment analysis method based on Transformer[J]. Journal of Northwest University (Natural Science Edition), 2024,54(2):156-167.
田昌宁, 贺昱政, 王笛, 等. 基于Transformer的多子空间多模态情感分析[J]. 西北大学学报（自然科学版）, 2024,54(2):156-167. DOI： 10.16152/j.cnki.xdxbzr.2024-02-002.

TIAN Changning, HE Yuzheng, WANG Di, et al. Multi-subspace multimodal sentiment analysis method based on Transformer[J]. Journal of Northwest University (Natural Science Edition), 2024,54(2):156-167. DOI： 10.16152/j.cnki.xdxbzr.2024-02-002.

摘要

多模态情感分析是指通过文本、视觉和声学信息识别视频中人物表达出的情感。现有方法大多通过设计复杂的融合方案学习多模态一致性信息，而忽略了模态间和模态内的差异化信息，导致缺少对多模态融合表示的信息补充。为此提出了一种基于Transformer的多子空间多模态情感分析(multi-subspace Transformer fusion network for multimodal sentiment analysis，MSTFN)方法。该方法将不同模态映射到私有和共享子空间，获得不同模态的私有表示和共享表示，学习每种模态的差异化信息和统一信息。首先，将每种模态的初始特征表示分别映射到各自的私有和共享子空间，学习每种模态中包含独特信息的私有表示与包含统一信息的共享表示。其次，在加强文本模态和音频模态作用的前提下，设计二元协同注意力跨模态Transformer模块，得到基于文本和音频的三模态表示。然后，使用模态私有表示和共享表示生成每种模态的最终表示，并两两融合得到双模态表示，以进一步补充多模态融合表示的信息。最后，将单模态表示、双模态表示和三模态表示拼接作为最终的多模态特征进行情感预测。在2个基准多模态情感分析数据集上的实验结果表明，该方法与最好的基准方法相比，在二分类准确率指标上分别提升了0.025 6/0.014 3和0.000 7/0.002 3。

Abstract

Multimodal sentiment analysis refers to recognizing the emotions expressed by characters in a video through textual

visual and acoustic information. Most of the existing methods learn multimodal coherence information by designing complex fusion schemes

while ignoring inter-and intra-modal differentiation information

resulting in a lack of information complementary to multimodal fusion representations. To this end

we propose a multi-subspace Transformer fusion network for multimodal sentiment analysis (MSTFN) method. The method maps different modalities to private and shared subspaces to obtain private and shared representations of different modalities

learning differentiated and unified information for each modality. Specifically

the initial feature representations of each modality are first mapped to their respective private and shared subspaces to learn the private representation containing unique information and the shared representation containing unified information in each modality. Second

under the premise of strengthening the roles of textual and audio modalities

a binary collaborative attention cross-modal Transformer module is designed to obtain textual and audio-based tri-modal representations. Then

the final representation of each modality is generated using modal private and shared representations and fused two by two to obtain a bimodal representation to further complement the information of the multimodal fusion representation. Finally

the unimodal representation

bimodal representation

and trimodal representation are stitched together as the final multimodal feature for sentiment prediction. Experimental results on two benchmark multimodal sentiment analysis datasets show that the present method improves on the binary classification accuracy metrics by 0.025 6/0.014 3 and 0.000 7/0.002 3

respectively

compared to the best benchmark method.

关键词

多模态情感分析Transformer结构多子空间多头注意力机制

Keywords

multimodal sentiment analysisTransformer structuremultiple subspacesmulti-head attention mechanism

references

PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: From unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37(C): 98-125.

ZHANG Y, JIN R, ZHOU Z H. Understanding bag-of-words model: A statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52.

LI B F, LIU T, ZHAO Z, et al. Neural bag-of-ngrams[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco: ACM, 2017: 3067-3074.

CHEN P H, LIN C J, SCHÖLKOPF B. A tutorial on ν-support vector machines[J]. Applied Stochastic Models in Business and Industry, 2005, 21(2): 111-136.

RISH I. An empirical study of the naive Bayes classifier[J]. Journal of Universal Computer Science, 2001, 1(2):41-46.

ALBAWI S, ABED MOHAMMED T A, Al-ZAWI S. Understanding of a convolutional neural network[C]//2017 International Conference on Engineering and Technology (ICET). Antalya: IEEE, 2017: 1-6.

MALHOTRA P, VIG L, SHROFF G, et al.Long short term memory networks for anomaly detection in time series[C]//23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges: ESANN, 2015.

SONG K C, YAN Y H, CHEN W H, et al. Research and perspective on local binary pattern[J]. Acta Automatica Sinica, 2013, 39(6): 730-744.

WANG Z, YING Z L. Facial expression recognition based on local phase quantization and sparse representation[C]//2012 8th International Conference on Natural Computation. Chongqing: IEEE, 2012: 222-225.

KAMARAINEN J K. Gabor features in image analysis[C]//2012 3rd International Conference on Image Processing Theory, Tools and Applications (IPTA). Istanbul: IEEE, 2012: 13-14.

HAN W, CHAN C F, CHOY C S, et al. An efficient MFCC extraction method in speech recognition[C]//2006 IEEE International Symposium on Circuits and Systems (ISCAS). Kos: IEEE, 2006: 4pp.

PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//2016 IEEE 16th International Conference on Data Mining (ICDM). Barcelona: IEEE, 2016: 439-448.

KAMPMAN O, BAREZI E J, BERTERO D, et al. Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne: Association for Computational Linguistics, 2018: 606-611.

罗渊贻, 吴锐, 刘家锋, 等. 基于自适应权值融合的多模态情感分析方法[J/OL]. 软件学报. (2023-10-07)[2023-11-20]. https://doi.org/10.13328/j.cnki.jos.006998https://doi.org/10.13328/j.cnki.jos.006998.

LUO Y Y, WU R, LIU J F, et al. Multimodal sentiment analysis based on adaptive weight fusion[J/OL]. Journal of Software.(2023-10-07)[2023-11-20]. https://doi.org/10.13328/j.cnki.jos.006998https://doi.org/10.13328/j.cnki.jos.006998.

ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017: 1103-1114.

张涛, 郭青冰, 李祖贺, 等. MC-CA：基于模态时序列耦合与交互式多头注意力的多模态情感分析[J]. 重庆邮电大学学报(自然科学版), 2023, 35(4):680-687.

ZHANG T, GUO Q B, LI Z H, et al. MC-CA: Multimodal sentiment analysis based on modal temporalcoupling and interactive multi-head attention[J]. Journal of Chongqing University of Posts & Telecommunications (Natural Science Edition), 2023, 35(4):680-687.

陈宏松, 安俊秀, 陶全桧, 等. 基于BERT-VGG16的多模态情感分析模型[J]. 成都信息工程大学学报, 2022, 37(4):379-385.

CHEN H S, AN J X, TAO Q H, et al. Multi-modal sentiment analysis model based on BERT-VGG16[J]. Journal of Chengdu University of Information Technology, 2022, 37(4):379-385.

周柏男, 李旭, 范丰龙, 等. 基于交互注意力机制的多模态情感分析模型[J]. 大连工业大学学报, 2023, 42(5):378-384.

ZHOU B N, LI X, FAN F L, et al. Multi-modal sentiment analysis model based on interactive attention mechanism[J]. Journal of Dalian Polytechnic University, 2023, 42(5):378-384.

卢婵, 郭军军, 谭凯文, 等. 基于文本指导的层级自适应融合的多模态情感分析[J]. 山东大学学报(理学版), 2023, 58(12):31-40.

LU C, GUO J J, TAN K W, et al. Multimodal sentiment analysis based on text-guided hierarchical adaptive fusion[J]. Journal of Shandong University(Natural Science), 2023, 58(12):31-40.

HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021: 9180-9192.

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: ACM, 2017: 6000-6010.

DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2023-11-20]. http://arxiv.org/abs/1810.04805http://arxiv.org/abs/1810.04805.

DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2023-11-20]. http://arxiv.org/abs/2010.11929http://arxiv.org/abs/2010.11929.

ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans: ACM, 2018: 5634-5641.

WANG Y S, SHEN Y, LIU Z, et al. Words can shift: Dynamically adjusting word representations using nonverbal behaviors[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7216-7223.

TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[J]. Proceedings of the Conference Association for Computational Linguistics Meeting, 2019, 2019: 6558-6569.

SUN Z K, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8992-8999.

RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained transformers[J]. Proceedings of the Conference Association for Computational Linguistics Meeting, 2020, 2020: 2359-2369.

HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1122-1131.

YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797.

MAI S J, ZENG Y, ZHENG S J, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2276-2289.

ZHANG Q A, SHI L, LIU P Y, et al. RETRACTED ARTICLE: ICDN: Integrating consistency and difference networks by transformer for multimodal sentiment analysis[J]. Applied Intelligence, 2023, 53(12): 16332-16345.

LIN H, ZHANG P L, LING J D, et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据