Transformer-based Meme-sensitive Cross-modal Sentiment Analysis Using Visual-Textual Data in Social Media
Subject Areas : Neural networks and deep learningZahra Pakdaman 1 , Abbas Koochari 2 * , Arash Sharifi 3
1 - Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
2 - Department of Computer Engineering, Islamic Azad University, Science and Research Branch, Tehran, Iran
3 - Department of Computer Engineering, Islamic Azad University, Science and Research Branch, Tehran, Iran
Keywords: Visual Sentiment Analysis, Textual Sentiment Analysis, Vision Transformer, LDA, SBERT Bi-encoder,
Abstract :
Analyzing the sentiment of the social media data plays a crucial role in understanding users’ intentions, opinions, and behaviors. Given the extensive diversity of published content (i.e., image, text, audio, and video) leveraging this variety can significantly enhance the accuracy of sentiment analysis models. This study introduces a novel meme-sensitive cross-modal architecture designed to analyze users’ emotions by integrating visual and textual data. The proposed approach distinguishes itself by its capability to identify memes within image datasets, an essential step in recognizing context-rich and sentiment-driven visual content. The research methodology involves detecting memes and separating them from standard images. Form memes, embedded text is extracted and combined with user-generated captions, forming a unified textual input. Advanced feature extraction techniques are then applied: Vision Transformer (ViT) is employed for extracting visual features, while SBERT Bi-encoder is utilized to obtain meaningful textual embeddings. To address the challenges posed by high-dimensional data, Linear Discriminant Analysis (LDA) is used to reduce feature dimensionality while preserving critical classification information. A carefully designed neural network, consisting of two fully connected layers, processes the fused feature vector to predict sentiment classes. Experimental evaluation demonstrates the efficiency of the proposed method, achieving up to 90% accuracy on the MVSA-Single dataset and 80% accuracy on the MVSA-Multiple dataset. These results underscore the model’s ability to outperform existing state-of-the-art approaches in cross-modal sentiment analysis. This study highlights the importance of integrating meme recognition and multi-modal feature extraction for improving sentiment analysis, paving the way for feature research in this domain.
[1] Jain R, Rai RS, Jain S, Ahluwalia R, Gupta J (2023) Real time sentiment analysis of natural language using multimedia input. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-15213-3
[2] Ramamoorthy S, Gunti N, Mishra S, Suryavardan S, Reganti A, Patwa P, Das A, Chakraborty T, Sheth A, Ekbal A, Ahuja C (2022) Memotion2: Datset on sentiment and emotion analysis Memes. In Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR.
[3] Zhang K, Zhu Y, Zhang W, Zhu Y (2021) Cross-modal image sentiment analysis via deep correlation of textual semantic. Knowledge-Based Syst 216:. https://doi.org/10.1016/j.knosys.2021.106803
[4] Lu J, Batra D, Parikh D, Lee S (2019) ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32:1–11
[5] Xu J, Huang F, Zhang X, Wang S, Li C, Li Z, He Y (2019) Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowledge-Based Syst 178:61–73. https://doi.org/10.1016/j.knosys.2019.04.018
[6] Huang F, Zhang X, Zhao Z, Xu J, Li Z (2019) Image–text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Syst 167:26–37. https://doi.org/10.1016/j.knosys.2019.01.019)(0123456789,-().vol
[7] Li Z, Sun Q, Guo Q, Wu H, Deng L, Zhang Q, Zhang J, Zhang H, Chen Y (2021) Visual sentiment analysis based on image caption and adjective–noun–pair description. Soft Computing. https://doi.org/10.1007/s00500-021-06530-6(0123456789().,-volV
[8] Serra A, Carrara F, Tesconi M, Falchi F (2023) The Emotions of the crowd: Learning image sentiment from Tweets via cross-modal distillation. https://doi.org/10.48550/arXiv.2304.14942
[9] Zhu T, Li L, Yang J, Zhao S, Liu H, Qian J (2022) Multimodal sentiment analysis with image-text interaction network. IEEE Transactions on Multimedia 25: 3375 – 3385
[10] Yadav A, Vishwakarma D (2020) A deep multi-level attentive network for multimodal sentiment analysis. https://doi.org/10.48550/arXiv.2012.08256
[11] Li Z, Xu B, Zhu C, Zhao T (2022) CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection. https://doi.org/10.48550/arXiv.2204.05515
[12] Peng C, Zhang C, Xue X, Gao J, Liang H, Niu Z (2022) Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification. TSINGHUA SCIENCE AND TECHNOLOGY 27: 664:679
[13] Xu N, Mao W (2017) MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. In Proceeding of the 2017 ACM on Conference on Information and Knowledge Management (CIKM’17) 2399-2402
[14] Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recognit Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025
[15] Yang B, Shao B, Wu L, Lin X (2022) Multimodal Sentiment Analysis with Unidirectional Modality Translation. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.09.041
[16] Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L, Hoque E (2020) Integrating multimodal information in large pretrained transformers. Proc Annu Meet Assoc Comput Linguist 2359–2369. https://doi.org/10.18653/v1/2020.acl-main.214
[17] Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol - Proc Conf 1:4171–4186
[18] Hazarika D, Zimmermann R, Poria S (2020) MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. MM 2020 - Proc 28th ACM Int Conf Multimed 1122–1131. https://doi.org/10.1145/3394171.3413678
[19] Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. AAAI 2020 - 34th AAAI Conf Artif Intell 1359–1367. https://doi.org/10.1609/aaai.v34i02.5492
[20] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N
(2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/arXiv.2010.11929
[21] Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A (2023) A survey on visual transformer. https://doi.org/10.1109/TPAMI.2022.3152247
[22] Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP-IJCNLP 2019 - 2019 Conf Empir Methods Nat Lang Process 9th Int Jt Conf Nat Lang Process Proc Conf 3982–3992. https://doi.org/10.18653/v1/d19-1410
[23] Wang B, Kuo C (2020) SBERT-WK: A Sentence embedding method by dissecting BERT-based word models. https://doi.org/10.48550/arXiv.2002.06652
[24] Tsai, Y, Bai S, Liang P, Kolter J, Morency L, Salakhutdinov R (2019) Multi- modal transformer for unaligned multimodal language sequences. Proceedings of the Annual Meeting of the Association for Computational Linguistics 6558–6569
[25] Niu T, Zhu S, Pang L, Saddik A (2016) Sentiment analysis on multi-view social data. In International Conference on Multimedia Modeling 15–27
[26] Keila D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, Testuggine D (2020) The hateful memes challenge: Detecting hate speech in multimodal memes. Annual Conference on Neural Information Processing Systems 2611-2624
[27] Onita D, Dinu L, Adriana B (2019) From image to text in sentiment analysis via regression and deep learning. Proceedings of Recent Advances in Natural Language Processing 862–868
[28] Wang W, Wei F, Dong L, Bao H, Yang N, Zhau M (2020) MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
[29] Cer D, Yang Y, Kong S, Hua N, Limtiaco N, John R, Constant N, Guajardo-Ce ́spedes M, Yuan S, Tar C, Sung Y, Strope B, Kurzweil R (2018) Universal sentence encoder. https://doi.org/10.48550/arXiv.1803.11175, 2018
[30] Le Q, Mikolov T (2014) Distributed Representations of Sentences and Documents. https://doi.org/10.48550/arXiv.1405.4053