Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study
Subject Areas : Journal of Computer & RoboticsEbrahim Ganjalipour 1 , Amir Hossein Refahi Sheikhani 2 * , Sohrab Kordrostami 3 , Ali Asghar Hosseinzadeh 4
1 - Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran
2 - Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran
3 - Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran
4 - Department of Applied Mathematics and Computer Science,Lahijan Branch,
Islamic Azad University,Lahijan,Iran
Keywords: Transformer, Semantic Textual Similarity, English-Persian Semantic Similarity, SOV Word Order Language, Pointwise Mutual Information,
Abstract :
Semantic Textual Similarity (STS) is considered one of the subfields of natural language processing that has gained extensive research attention in recent years. Measuring the semantic similarity between words, phrases, paragraphs, and documents plays a significant role in natural language processing and computational linguistics. Semantic Textual Similarity finds applications in plagiarism detection, machine translation, information retrieval, and similar areas. STS aims to develop computational methods that can capture the nuanced degrees of resemblance in meaning between words, phrases, sentences, paragraphs, or even entire documents which is a challenging task for languages with low digital resources. This task becomes intricate in languages with pronoun-dropping and Subject-Object-Verb (SOV) word order specifications, such as Persian, due to their distinctive syntactic structures. One of the most important aspects of linguistic diversity lies in word order variation within languages. Some languages adhere to Subject-Object-Verb (SOV) word order, while others follow Subject-Verb-Object (SVO) patterns. These structural disparities, compounded by factors like pronoun-dropping, render the task of measuring cross-lingual STS in such languages exceptionally intricate. In the context of low-resource languages like Persian, this study proposes a customized model based on linguistic properties. Leveraging pronoun-dropping and SOV word order specifications of Persian, we introduce an innovative enhancement: a novel weighted relative positional encoding integrated into the self-attention mechanism. Moreover, we enrich context representations by infusing co-occurrence information through pointwise mutual information (PMI) factors. This paper introduces a cross-lingual model for semantic similarity analysis between Persian and English texts, utilizing parallel corpora. The experiments show that our proposed model achieves better performance than other models. Ablation study also shows that our system can converge faster and is less prone to overfitting. The proposed model is evaluated on Persian-English and Persian-Persian STS-Benchmarks and achieved 88.29% and 91.65% Pearson correlation coefficients on monolingual and cross-lingual STS-B, respectively.