Transformer-based Meme-sensitive Cross-modal Sentiment Analysis Using Visual-Textual Data in Social Media
Subject Areas : Neural networks and deep learningzahra pakdaman 1 , abbas koochari 2 , Arash Sharifi 3
1 - Computer engineering, Science and reserach Branch, Islamiz azad university, Tehran, Iran
2 -
3 - Department of Computer Engineering, Mechanics, electricity and computers, Islamic Azad University - Science and Research Branch, Tehran, Iran
Keywords: Visual Sentiment Analysis, Textual Sentiment Analysis, Vision Transformer, LDA, SBERT Bi-encoder,
Abstract :
Analyzing the sentiment of the social media data helps to understand the main purpose of users. Since the published data has a large diversity (i.e., image, text, audio, and video), this variety can be used to achieve a more accurate sentiment analysis. This study introduces a meme-sensitive cross-modal architecture for analyzing users’ feelings using both visual and textual data. The research process involves the recognition of memes from regular images. Following the extraction of embedded text in memes and its concatenation with the user’s caption, text and image features are extracted using transformers. Specifically, Vision Transformer (ViT) and SBERT Bi-encoder are used for visual and textual feature extraction, respectively. Subsequently, Linear Discriminant Analysis (LDA) transformation is applied to reduce dimensionality and enhance classification. Finally, two fully connected layers process the resulting vector to predict the sentiment class. Experimental results show an achievement of up to 90% on MVSA-Single and 80% on MVSA-Multiple datasets for the proposed method. Therefore, it has superior performance compared to other state-of-the-art ones.
[1] Jain R, Rai RS, Jain S, Ahluwalia R, Gupta J (2023) Real time sentiment analysis of natural language using multimedia input. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-15213-3
[2] Ramamoorthy S, Gunti N, Mishra S, Suryavardan S, Reganti A, Patwa P, Das A, Chakraborty T, Sheth A, Ekbal A, Ahuja C (2022) Memotion2: Datset on sentiment and emotion analysis Memes. In Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR.
[3] Zhang K, Zhu Y, Zhang W, Zhu Y (2021) Cross-modal image sentiment analysis via deep correlation of textual semantic. Knowledge-Based Syst 216:. https://doi.org/10.1016/j.knosys.2021.106803
[4] Lu J, Batra D, Parikh D, Lee S (2019) ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32:1–11
[5] Xu J, Huang F, Zhang X, Wang S, Li C, Li Z, He Y (2019) Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowledge-Based Syst 178:61–73. https://doi.org/10.1016/j.knosys.2019.04.018
[6] Huang F, Zhang X, Zhao Z, Xu J, Li Z (2019) Image–text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Syst 167:26–37. https://doi.org/10.1016/j.knosys.2019.01.019)(0123456789,-().vol
[7] Li Z, Sun Q, Guo Q, Wu H, Deng L, Zhang Q, Zhang J, Zhang H, Chen Y (2021) Visual sentiment analysis based on image caption and adjective–noun–pair description. Soft Computing. https://doi.org/10.1007/s00500-021-06530-6(0123456789().,-volV
[8] Serra A, Carrara F, Tesconi M, Falchi F (2023) The Emotions of the crowd: Learning image sentiment from Tweets via cross-modal distillation. https://doi.org/10.48550/arXiv.2304.14942
[9] Zhu T, Li L, Yang J, Zhao S, Liu H, Qian J (2022) Multimodal sentiment analysis with image-text interaction network. IEEE Transactions on Multimedia 25: 3375 – 3385
[10] Yadav A, Vishwakarma D (2020) A deep multi-level attentive network for multimodal sentiment analysis. https://doi.org/10.48550/arXiv.2012.08256
[11] Li Z, Xu B, Zhu C, Zhao T (2022) CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection. https://doi.org/10.48550/arXiv.2204.05515
[12] Peng C, Zhang C, Xue X, Gao J, Liang H, Niu Z (2022) Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification. TSINGHUA SCIENCE AND TECHNOLOGY 27: 664:679
[13] Xu N, Mao W (2017) MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. In Proceeding of the 2017 ACM on Conference on Information and Knowledge Management (CIKM’17) 2399-2402
[14] Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recognit Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025
[15] Yang B, Shao B, Wu L, Lin X (2022) Multimodal Sentiment Analysis with Unidirectional Modality Translation. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.09.041
[16] Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L, Hoque E (2020) Integrating multimodal information in large pretrained transformers. Proc Annu Meet Assoc Comput Linguist 2359–2369. https://doi.org/10.18653/v1/2020.acl-main.214
[17] Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol - Proc Conf 1:4171–4186
[18] Hazarika D, Zimmermann R, Poria S (2020) MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. MM 2020 - Proc 28th ACM Int Conf Multimed 1122–1131. https://doi.org/10.1145/3394171.3413678
[19] Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. AAAI 2020 - 34th AAAI Conf Artif Intell 1359–1367. https://doi.org/10.1609/aaai.v34i02.5492
[20] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N
(2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/arXiv.2010.11929
[21] Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A (2023) A survey on visual transformer. https://doi.org/10.1109/TPAMI.2022.3152247
[22] Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP-IJCNLP 2019 - 2019 Conf Empir Methods Nat Lang Process 9th Int Jt Conf Nat Lang Process Proc Conf 3982–3992. https://doi.org/10.18653/v1/d19-1410
[23] Wang B, Kuo C (2020) SBERT-WK: A Sentence embedding method by dissecting BERT-based word models. https://doi.org/10.48550/arXiv.2002.06652
[24] Tsai, Y, Bai S, Liang P, Kolter J, Morency L, Salakhutdinov R (2019) Multi- modal transformer for unaligned multimodal language sequences. Proceedings of the Annual Meeting of the Association for Computational Linguistics 6558–6569
[25] Niu T, Zhu S, Pang L, Saddik A (2016) Sentiment analysis on multi-view social data. In International Conference on Multimedia Modeling 15–27
[26] Keila D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, Testuggine D (2020) The hateful memes challenge: Detecting hate speech in multimodal memes. Annual Conference on Neural Information Processing Systems 2611-2624
[27] Onita D, Dinu L, Adriana B (2019) From image to text in sentiment analysis via regression and deep learning. Proceedings of Recent Advances in Natural Language Processing 862–868
[28] Wang W, Wei F, Dong L, Bao H, Yang N, Zhau M (2020) MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
[29] Cer D, Yang Y, Kong S, Hua N, Limtiaco N, John R, Constant N, Guajardo-Ce ́spedes M, Yuan S, Tar C, Sung Y, Strope B, Kurzweil R (2018) Universal sentence encoder. https://doi.org/10.48550/arXiv.1803.11175, 2018
[30] Le Q, Mikolov T (2014) Distributed Representations of Sentences and Documents. https://doi.org/10.48550/arXiv.1405.4053