رویکرد یادگیری اشتراکی بر مبنای شبکه های عصبی مبتنی بر توجه برای مشابهت یابی متون

محورهای موضوعی : پردازش چند رسانه ای، سیستمهای ارتباطی، سیستمهای هوشمند

ابراهیم گنجعلی پور ¹ , امیر حسین رفاهی شیخانی ^{2
*} , سهراب کردرستمی ³ , علی اصغر حسین زاده ⁴

1 - 1. دانشجوی دکتری، دانشکده ریاضی کاربردی و علوم کامپیوتر، واحد لاهیحان، دانشگاه آزاد اسلامی، لاهیجان، ایران
2 - 2. دانشیار، داانشکده ریاضی کاربردی و علوم کامپیوتر ، واحد لاهیحان، دانشگاه آزاد اسلامی، لاهیجان، ایران
3 - 3. استاد، دانشکده ریاضی کاربردی و علوم کامپیوتر ، واحد لاهیحان، دانشگاه آزاد اسلامی، لاهیجان، ایران
4 - 4. استادیار، دانشکده ریاضی کاربردی و علوم کامپیوتر ، واحد لاهیحان، دانشگاه آزاد اسلامی، لاهیجان، ایران

تاریخ دریافت : 1402/08/10 تاریخ پذیرش : 1402/09/20 تاریخ انتشار : 1402/10/01

کلید واژه: پردازش زبان های طبیعی, مشابهت یابی معنایی متون, شبکه های عصبی مبتنی بر توجه, ترنسفورمر , اطلاعات مشترک نقطه ای,

چکیده مقاله :

مشابهت یابی معنایی متون (STS)یک وظیفه چالش‌برانگیز در زبان‌های با منابع دیجیتالی محدود است، دشواری‌های اصلی ناشی از کمبود مجموعه‌های آموزشی دسته‌بندی‌شده و مشکلات مرتبط با آموزش مدل‌های مؤثر است. در اینجا یک رویکرد یادگیری مشترک با استفاده از مدل خودتوجه بهبودیافته برای مقابله با چالش STS در ساختارهای زبانی (فاعل، مفعول، فعل) SOV و (فاعل، فعل، مفعول) SVO معرفی شده است. ابتدا یک مجموعه داده چندزبانه جامع با داده‌های موازی برای زبان‌های SOV و SVO را ایجاد کرده و تنوع زبانی گسترده‌ای را تضمین می‌کنیم. ما یک مدل خودتوجه بهبودیافته با رمزگذاری نسبی موقعیت وزن‌دار جدید غنی‌شده با تزریق اطلاعات هم‌رخدادی از طریق عوامل اطلاعات مشترک نقطه‌ای (PMI) معرفی می‌کنیم. علاوه بر این، ما از یک چارچوب یادگیری مشترک استفاده می‌کنیم که نمونه های مشترک بین زبان‌ها را به منظور بهبود STS بین زبانی استفاده می‌کند. با آموزش همزمان در چندین جفت زبان، مدل ما توانایی انتقال دانش را به دست می‌آورد و به طور مؤثر پل ارتباطی بین زبان‌های با ساختارهای متفاوت SOV و SVO ایجاد می کند. مدل پیشنهادی ما بر روی مجموعه داده‌های STS- Benchmarks فارسی-انگلیسی و فارسی-فارسی ارزیابی شد و به ترتیب به ضریب همبستگی پیرسون 88.29٪ و 91.65٪ دست‌یافت. آزمایش‌های انجام‌شده نشان می‌دهند که مدل پیشنهادی ما در مقایسه با مدل‌های دیگر عملکرد بهتری داشته است. مطالعه کاهشی نیز نشان می‌دهد که سیستم ما قادر به همگرایی سریعتر است و کمتر مستعد بیش برازش است

چکیده انگلیسی:

Introduction: Semantic Textual Similarity (STS) across languages is a pivotal challenge in natural language processing, with applications ranging from plagiarism detection to machine translation. Despite significant strides in STS, it remains a formidable task in languages with distinct syntactic structures and limited digital resources. Linguistic diversity, especially in word order variation, poses unique challenges, exemplified by languages adhering to Subject-Object-Verb (SOV) or Subject-Verb-Object (SVO) patterns, compounded by complexities like pronoun-dropping. This paper addresses the intricate task of measuring STS in Persian, characterized by SOV word order and distinctive linguistic features.

Method: We propose a novel joint learning approach, harnessing an enhanced self-attention model, to tackle the STS challenge in both SOV and SVO language structures. Our methodology involves establishing a comprehensive multilingual corpus with parallel data for SOV and SVO languages, ensuring a diverse representation of linguistic structures. An improved self-attention model is introduced, featuring weighted relative positional encoding and enriched context representations infused with co-occurrence information through pointwise mutual information (PMI) factors. A joint learning framework leverages shared representations across languages, facilitating effective knowledge transfer and bridging the linguistic gap between SOV and SVO languages.

Results: Our model, trained on Persian-English and Persian-Persian language pairs simultaneously, successfully extracts informative features, explicitly considering differences in word order and pronoun-dropping. During the training, the batch is sampled from STS benchmark with English and Translated Persian Pair texts and fed into customized encoder to get attention matrix and output embeddings. Then, the similarity module predicts the STS score. We use the STS score to compute the Mean Square Error (MSE) loss. Evaluation on Persian-English and Persian-Persian STS-Benchmarks demonstrates impressive performance, achieving Pearson correlation coefficients of 89.51% and 92.47%, respectively. Comparative experiments reveal superior performance against existing models, emphasizing the effectiveness of our proposed approach.

Discussion: The ablation study further substantiates the robustness of our system, showcasing faster convergence and reduced susceptibility to overfitting. The results underscore the significance of our enhanced model in addressing the complexities of measuring semantic similarity in languages with diverse linguistic structures and limited digital resources. The approach not only advances cross-lingual STS capabilities but also provides insights into handling syntactic variations, such as SOV and SVO word orders, and pronoun-dropping. This research opens avenues for future investigations into enhancing STS in languages with unique structural characteristics.

منابع و مأخذ:

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," presented at the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
[2] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
[3] E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre, "Semeval-2012 task 6: A pilot on semantic textual similarity," in * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), 2012, pp. 385-393.
[4] A. Islam and D. Inkpen, "Semantic text similarity using corpus-based word similarity and string similarity," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 2, no. 2, pp. 1-25, 2008.
[5] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," arXiv preprint arXiv:1910.01108, 2019.
[6] X. Tang et al., "Improving multilingual semantic textual similarity with shared sentence encoder for low-resource languages," arXiv preprint arXiv:1810.08740, 2018.
[7] T. Brychcín, "Linear transformations for cross-lingual semantic textual similarity," Knowledge-Based Systems, vol. 187, p. 104819, 2020.
[8] Y. Sever and G. Ercan, "Evaluating cross-lingual textual similarity on dictionary alignment problem," Language Resources and Evaluation, vol. 54, pp. 1059-1078, 2020.
[9] T. Pires, E. Schlinger, and D. Garrette, "How multilingual is multilingual BERT?," arXiv preprint arXiv:1906.01502, 2019.
[10] T. Kudo and J. Richardson, "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing," arXiv preprint arXiv:1808.06226, 2018.
[11] K. Church and P. Hanks, "Word association norms, mutual information, and lexicography," Computational linguistics, vol. 16, no. 1, pp. 22-29, 1990.
[12] J. A. Bullinaria and J. P. Levy, "Extracting semantic representations from word co-occurrence statistics: A computational study," Behavior research methods, vol. 39, no. 3, pp. 510-526, 2007.
[13] D. Kiela and S. Clark, "A systematic study of semantic vector space model parameters," presented at the Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), 2014.
[14] Y. Liu et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.
[15] P. Shaw, J. Uszkoreit, and A. Vaswani, "Self-attention with relative position representations," arXiv preprint arXiv:1803.02155, 2018.
[16] A. Singh, A. Yadav, and A. Rana, "K-means with Three different Distance Metrics," International Journal of Computer Applications, vol. 67, no. 10, 2013.
[17] D. Cer et al., "Universal sentence encoder for English," in Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, 2018, pp. 169-174.
[18] l. naderloo and M. Tahghighi Sharabyan, "Presenting a model for Multi-layer Dynamic Social Networks to discover Influential Groups based on a combination of Developing Frog-Leaping Algorithm and C-means Clustering," Intelligent Multimedia Processing and Communication Systems (IMPCS), vol. 3, no. 3, pp. 29-39, 2022.
[19] L. Gonbadi and N. Ranjbar, "Sentiment Analysis of People’s opinion about Iranian National Cars with BERT," Intelligent Multimedia Processing and Communication Systems (IMPCS), vol. 3, no. 4, pp. 51-60, 2022.

مقالات مرتبط

Adaptive-PGRP: الگوریتم مسیریابی در شبکه‌های VANET بر اساس الگوریتم PGRP با ارسال تطبیقی پیام های Hello
تاریخ چاپ : 1402/10/01
بررسی رابطه ارزیابی محصولات و پذیرش اعتماد توسط مصرف کننده بر قصد خرید مجدد در محیط تجارت الکترونیک (مطالعه موردی: سایت‌ دیجی کالا)
تاریخ چاپ : 1401/07/01
تشخیص بیماری پارکینسون با استفاده از تحلیل سیگنال‌های الکتروانسفالوگرام مبتنی بر تبدیل والش هادامارد
تاریخ چاپ : 1400/04/01
ارتقای امنیت اینترنت اشیا در شبکه زیگبی با استفاده از الگوریتم AES256
تاریخ چاپ : 1399/10/01
بررسی نقش قابلیتهای رسانه های نوین(پلتفرم: اینستاگرام)، بر تجارت الکترونیک(قصد خرید پوشاک زنانه) با توجه به نقش نگرش برند
تاریخ چاپ : 1402/07/01
افزایش دقت شناسایی جوامع همپوشان با استفاده از وزن‌دهی یال‌ها
تاریخ چاپ : 1399/07/01

اشتراک گذاری

آدرس مقاله

رویکرد یادگیری اشتراکی بر مبنای شبکه های عصبی مبتنی بر توجه برای مشابهت یابی متون