嵌入混合注意力机制的Swin Transformer人脸表情识别

王坤侠; 余万成; 胡玉霞

doi:10.16152/j.cnki.xdxbzr.2024-02-003

您当前的位置：

首页 >

文章列表页 >

嵌入混合注意力机制的Swin Transformer人脸表情识别

人工智能情感计算 | 更新时间：2024-04-19

- 嵌入混合注意力机制的Swin Transformer人脸表情识别
- Facial expression recognition in Swin Transformer by embedding hybrid attention mechanism
- 西北大学学报（自然科学版） 2024年54卷第2期页码：168-176
- 作者机构：
  
  1.安徽建筑大学　电子与信息工程学院，安徽　合肥　230601
  2.安徽省古建筑智能感知与高维建模国际联合研究中心，安徽　合肥　230601
- 作者简介：
  
  王坤侠，女，博士，副教授，从事人工智能情感计算研究，kxwang@ahjzu.edu.cn。
- 基金信息：
  
  国家自然科学基金青年项目(62105002);安徽省住房城乡建设科学技术计划项目(2023-YF113;2023-YF004);安徽建筑大学智能建筑与建筑节能安徽省重点实验室开放课题(IBES2022ZR02)
- DOI：10.16152/j.cnki.xdxbzr.2024-02-003
  中图分类号： TP391.4
- 纸质出版日期：2024-04-25，
  
  收稿日期：2023-10-18，
扫描看全文
王坤侠, 余万成, 胡玉霞. 嵌入混合注意力机制的Swin Transformer人脸表情识别[J]. 西北大学学报（自然科学版）, 2024,54(2):168-176.

WANG Kunxia, YU Wancheng, HU Yuxia. Facial expression recognition in Swin Transformer by embedding hybrid attention mechanism[J]. Journal of Northwest University (Natural Science Edition), 2024,54(2):168-176.
王坤侠, 余万成, 胡玉霞. 嵌入混合注意力机制的Swin Transformer人脸表情识别[J]. 西北大学学报（自然科学版）, 2024,54(2):168-176. DOI： 10.16152/j.cnki.xdxbzr.2024-02-003.

WANG Kunxia, YU Wancheng, HU Yuxia. Facial expression recognition in Swin Transformer by embedding hybrid attention mechanism[J]. Journal of Northwest University (Natural Science Edition), 2024,54(2):168-176. DOI： 10.16152/j.cnki.xdxbzr.2024-02-003.

摘要

人脸表情识别是心理学领域的一个重要研究方向，可应用于交通、医疗、安全和刑事调查等领域。针对卷积神经网络(CNN)在提取人脸表情全局特征的局限性，提出了一种嵌入混合注意力机制的Swin Transformer人脸表情识别方法，以Swin Transformer为主干网络，在模型Stage3的融合层(Patch Merging)中嵌入了混合注意力模块，该方法能够有效提取人脸面部表情的全局特征和局部特征。首先，层次化的Swin Transformer模型可有效获取深层全局特征信息。其次，嵌入的混合注意力模块结合了通道和空间注意力机制，在通道维度和空间维度上进行特征提取，从而让模型能够更好地提取局部位置的特征信息。同时，采用迁移学习方法对模型网络权重进行初始化，进而提高模型的精度和泛化能力。所提方法在FER2013、RAF-DB和JAFFE这3个公共数据集上分别达到了73.63%、87.01%和98.28%的识别准确率，取得了较好的识别效果。

Abstract

Facial expression recognition is an important research domain in psychology that can be applied to many fields such as transportation

medical care

security

and criminal investigation. Given the limitations of convolutional neural networks (CNN) in extracting global features of facial expressions

this paper proposes a Swin Transformer method embedded with a hybrid attention mechanism for facial expression recognition. Using the Swin Transformer as the backbone network

a hybrid attention module is embedded in the fusion layer (Patch Merging) in the model of Stage3

which can effectively extract global and local features from facial expressions.Firstly

the hierarchical Swin Transformer model can effectively obtain deep global features.Secondly

the embedded hybrid attention module combines channel and spatial attention mechanisms to extract features in the channel dimension and spatial dimension

which can attain better local features. At the same time

this article uses the transfer learning method to initialize the model network weights

thereby improving the recognition performance and generalization ability.The proposed method achieved recognition accuracies of 73.63%

87.01%

and 98.28% on three public datasets (FER2013

RAF-DB

and JAFFE)respectively

achieving good recognition results.

关键词

表情识别Transformer注意力机制迁移学习

Keywords

expression recognitionTransformerattention mechanismtransfer learning

references

李珊, 邓伟洪. 深度人脸表情识别研究进展[J]. 中国图象图形学报, 2020, 25(11):2306-2320.

LI S, DENG W H. Deep facial expression recognition: A survey[J]. Journal of Image and Graphics, 2020, 25(11):2306-2320.

ADYAPADY R R, ANNAPPA B. A comprehensive review of facial expression recognition techniques[J]. Multimedia Systems, 2023, 29(1): 73-103.

HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

QI Y F, ZHOU C Y, CHEN Y X. NA-Resnet: Neighbor Block and optimized attention module for global-local feature extraction in facial expression recognition[J]. Multimedia Tools and Applications, 2023, 82(11): 16375-16393.

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: ACM, 2017: 6000-6010.

MA T L, MAO M Y, ZHENG H H, et al. Oriented object detection with transformer[EB/OL]. (2021-06-06)[2023-09-20]. http://arxiv.org/abs/2106.03146http://arxiv.org/abs/2106.03146.

DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2023-09-20]. http://arxiv.org/abs/2010.11929http://arxiv.org/abs/2010.11929.

WANG W H, XIE E Z, LI X, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 548-558.

WU H P, XIAO B, CODELLA N, et al. CvT: Introducing convolutions to vision transformers[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 22-31.

LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.

LIU C, HIROTA K, DAI Y P. Patch attention convolutional vision transformer for facial expression recognition with occlusion[J]. Information Sciences, 2023, 619(C): 781-794.

FENG H Q, HUANG W K, ZHANG D H, et al. Fine-tuning swin transformer and multiple weights optimality-seeking for facial expression recognition[J]. IEEE Access, 2023, 11: 9995-10003.

CHEN X C, ZHENG X W, SUN K, et al. Self-supervised vision transformer-based few-shot learning for facial expression recognition[J]. Information Sciences, 2023, 634(C): 206-226.

祁宣豪, 智敏. 图像处理中注意力机制综述[J]. 计算机科学与探索, 2024, 18(2):345-362.

QI X H, ZHI M. Review of attention mechanisms in image processing[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2):345-362.

JADERBERG M, SIMONYAN K, ZISSERMAN A, et al. Spatial transformer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. Montreal: ACM, 2015: 2017-2025.

WANG Q L, WU B G, ZHU P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.

WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]// European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 3-19.

冯晓毅, 黄东, 崔少星, 等. 基于空时注意力网络的面部表情识别[J]. 西北大学学报(自然科学版), 2020, 50(3):319-327.

FENG X Y, HUANG D, CUI S X. Spatial-temporal attention network forfacial expression recognition[J]. Journal of Northwest University(Natural Science Edition).2020, 50(3):319-327.

HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7132-7141.

GOODFELLOW I J, ERHAN D, CARRIER P L, et al. Challenges in representation learning: A report on three machine learning contests[C]//The 20th International Conference on Neural Information Processing. Daegu: Springer, 2013:117-124.

LYONS M, AKAMATSU S, KAMACHI M, et al. Coding facial expressions with Gabor wavelets[C]//Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition. Nara: IEEE, 2002: 200-205.

LI S, DENG W H, DU J P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2584-2593.

WANG K, PENG X J, YANG J F, et al. Region attention networks for pose and occlusion robust facial expression recognition[J]. IEEE Transactions on Image Processing, 2020, 29: 4057-4069.

CHU X X, TIAN Z, WANG Y Q, et al. Twins: Revisiting the design of spatial attention in vision transformers[EB/OL]. (2021-09-30)[2023-09-20]. http://arxiv.org/abs/2104.13840http://arxiv.org/abs/2104.13840.

ZHENG C, MENDIETA M, CHEN C. POSTER: A pyramid cross-fusion transformer network for facial expression recognition[C]//2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Paris: IEEE, 2023: 3138-3147.

LI Y J, LU G M, LI J X, et al. Facial expression recognition in the wild using multi-level features and attention mechanisms[J]. IEEE Transactions on Affective Computing, 2023, 14(1):451-462.

SINGH R, SHARMA H, MEHTA N K, et al. Efficientnet for human fer using transfer learning[J]. ICTACT Journal on Soft Computing, 2023, 13(1): 2792-2797.

WANG K X, HE R X, WANG S, et al. The Efficient-CapsNet model for facial expression recognition[J]. Applied Intelligence, 2023, 53(13): 16367-16380.

LI S Q, LI W, WEN S P, et al. Auto-FERNet: A facial expression recognition network with architecture search[J]. IEEE Transactions on Network Science and Engineering, 2021, 8(3): 2213-2222.

MEENA G, MOHBEY K K, KUMAR S. Sentiment analysis on images using convolutional neural networks based Inception-V3 transfer learning approach[J]. International Journal of Information Management Data Insights, 2023, 3(1): 100174.

SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 618-626.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于具有空间注意力机制的Mask R-CNN的口腔白斑分割

多源描述符融合的药物－靶标相互作用预测框架

基于生成对抗网络的乳腺MRI图像生成

融合注意力与CorNet的多标签文本分类

基于双路径特征融合的结肠组织病理腺体分割方法