Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study

Ganjalipour, Ebrahim; Refahi Sheikhani, Amir Hossein; Kordrostami, Sohrab; Hosseinzadeh, Ali Asghar

doi:10.22094/jcr.2023.1996326.1315

Manuscript ID : JCR-2309-1315 (R1) Visit : 670 Page: 47 - 60

10.22094/jcr.2023.1996326.1315

Article Type: Original Research

Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study

Subject Areas : Journal of Computer & Robotics

Ebrahim Ganjalipour ¹ , Amir Hossein Refahi Sheikhani ^{2
*} , Sohrab Kordrostami ³ , Ali Asghar Hosseinzadeh ⁴

1 - Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran
2 - Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran
3 - Department of Applied Mathematics and Computer Science, Lahijan Branch, Islamic Azad University, Lahijan, Iran
4 - Department of Applied Mathematics and Computer Science,Lahijan Branch, Islamic Azad University,Lahijan,Iran

Received: 2023-09-12 Accepted : 2023-09-21 Published : 2023-10-01

Keywords: Transformer, Semantic Textual Similarity, English-Persian Semantic Similarity, SOV Word Order Language, Pointwise Mutual Information,

Abstract :

Semantic Textual Similarity (STS) is considered one of the subfields of natural language processing that has gained extensive research attention in recent years. Measuring the semantic similarity between words, phrases, paragraphs, and documents plays a significant role in natural language processing and computational linguistics. Semantic Textual Similarity finds applications in plagiarism detection, machine translation, information retrieval, and similar areas. STS aims to develop computational methods that can capture the nuanced degrees of resemblance in meaning between words, phrases, sentences, paragraphs, or even entire documents which is a challenging task for languages with low digital resources. This task becomes intricate in languages with pronoun-dropping and Subject-Object-Verb (SOV) word order specifications, such as Persian, due to their distinctive syntactic structures. One of the most important aspects of linguistic diversity lies in word order variation within languages. Some languages adhere to Subject-Object-Verb (SOV) word order, while others follow Subject-Verb-Object (SVO) patterns. These structural disparities, compounded by factors like pronoun-dropping, render the task of measuring cross-lingual STS in such languages exceptionally intricate. In the context of low-resource languages like Persian, this study proposes a customized model based on linguistic properties. Leveraging pronoun-dropping and SOV word order specifications of Persian, we introduce an innovative enhancement: a novel weighted relative positional encoding integrated into the self-attention mechanism. Moreover, we enrich context representations by infusing co-occurrence information through pointwise mutual information (PMI) factors. This paper introduces a cross-lingual model for semantic similarity analysis between Persian and English texts, utilizing parallel corpora. The experiments show that our proposed model achieves better performance than other models. Ablation study also shows that our system can converge faster and is less prone to overfitting. The proposed model is evaluated on Persian-English and Persian-Persian STS-Benchmarks and achieved 88.29% and 91.65% Pearson correlation coefficients on monolingual and cross-lingual STS-B, respectively.

References:

A convolutional deep learning framework for classification of EEG signals
Print Date : 2024-11-18
MSDSA: Imbalanced Data Sentiment Analysis using Manifold Smoothness Satisfied Data
Print Date : 2024-11-18
Forecasting Auto Spare Parts Demand in Iran: A Hybrid Neural Network Approach with Meta-heuristic Optimization
Print Date : 2024-11-18
Evaluating the Harris ,BRISK and SURF Feature Points in Watermarking Based on Histogram and Knapsack Problem
Print Date : 2024-11-18
A Robust Optimization Framework for Energy Management in Energy Hubs: Comparative Analysis of SMA, GA, and MILP with Demand Response Integration
Print Date : 2024-11-18
Perturbed Masking for aspect-based sentiment analysis
Print Date : 2024-11-18

Share To

Article Url

Enhanced Self-Attention Model for Cross-Lingual Semantic Textual Similarity in SOV and SVO Languages: Persian and English Case Study