A Review of Anonymity Algorithms in Big Data

shamsinejad, Elham; Pedram, Mir Mohsen; Rahamni, Amir Masoud; BaniRostam, Touraj

Manuscript ID : JACET-2110-1483 (R1) Visit : 306 Page: 187 - 196

Article Type: Original Research

A Review of Anonymity Algorithms in Big Data

Subject Areas : Network Security

Elham shamsinejad ¹ , Mir Mohsen Pedram ² , Amir Masoud Rahamni ³ , Touraj BaniRostam ^{4
*}

1 - IAUCTB
2 - Associate Professor of Electrical and Computer Engineering Department, Faculty of Engineering, Kharazmi University
3 - Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
4 - IAUCTB

Received: 2021-10-02 Accepted : 2022-03-09 Published : 2021-08-01

Keywords: Confidentiality Disclosure, Big Data, Anonymity,

Abstract :

By increasing access to high amounts of data through internet-based technologies such as social networks and mobile phones and electronic devices, many companies have considered the issues of accessing large, random and fast data along with maintaining data confidentiality. Therefore, confidentiality concerns and protection of specific data disclosure are one of the most challenging topics. In this paper, a variety of data anonymity methods, anonymity operators, the attacks that can endanger data anonymity and lead to the disclosure of sensitive data in the big data have been investigated. Also, different aspects of big data such as data sources, content format, data preparation, data processing and common data repositories will be discussed. Privacy attacks and contrastive techniques like k anonymity, neighborhood t and L diversity have been investigated and two main challenges to use k anonymity on big data will be identified, as well. Two main challenges to use k anonymity on big data will be identified. The first challenge of confidential attributes can also be as pseudo-identifier attributes, which increases the number of pseudo-identifier elements, and it may lead to the loss of great information to achieve k anonymity. The second challenge in big data is the unlimited number of data controllers are likely to lead to the disclosure of sensitive data through the independent publication of k anonymity. Then different anonymity algorithms will be presented and finally, the different parameters of time order and the consumable space of big data anonymity algorithms will be compared.

References:

Full-Text:

Preparation of Papers for IJCET

Journal of Advances in Computer Engineering and Technology

Received (Day Month Year)

Revised (Day Month Year)

Accepted (Day Month Year)

A Review of Anonymity Algorithms in Big Data

Abstract—By increasing access to high amounts of data through internet-based technologies such as social networks and mobile phones and electronic devices, many companies have considered the issues of accessing large, random and fast data along with maintaining data confidentiality. Therefore, confidentiality concerns and protection of specific data disclosure are one of the most challenging topics. In this paper, a variety of data anonymity methods, anonymity operators, the attacks that can endanger data anonymity and lead to the disclosure of sensitive data in the big data have been investigated. Also, different aspects of big data such as data sources, content format, data preparation, data processing and common data repositories will be discussed. Privacy attacks and contrastive techniques like k anonymity, neighborhood t and L diversity have been investigated and two main challenges to use k anonymity on big data will be identified, as well. Two main challenges to use k anonymity on big data will be identified. The first challenge of confidential attributes can also be as pseudo-identifier attributes, which increases the number of pseudo-identifier elements, and it may lead to the loss of great information to achieve k anonymity. The second challenge in big data is the unlimited number of data controllers are likely to lead to the disclosure of sensitive data through the independent publication of k anonymity. Then different anonymity algorithms will be presented and finally, the different parameters of time order and the consumable space of big data anonymity algorithms will be compared.

Index Terms—Big Data, Anonymity, Confidentiality Disclosure.

I. Introduction

ata is produced in various forms by different sources in the heterogeneous environment of the Internet. The volume of information stored and being produced is increasing every day and it has caused limitations in data storage and processing. Information sources, including social networks, information sensors, medical images and satellite imagery, produce structured, semi-structured and unstructured data continuously and uninterruptedly, with volumes exceeding several terabytes[1]-[3].

E.Shamsinejad is Ph.D. student in Department of Computer Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran. (e-mail: elham.shamsinejad@yahoo.com).

T.Banirostam is Assistant Professor of Computer Engineering Department, Islamic Azad University, Central Tehran Branch, Tehran, Iran. (e-mail: banirostam@iauctb.ac.ir).

M.M.Pedram is Associate Professor of Electrical and Computer Engineering Department, Faculty of Engineering, Kharazmi University, Tehran, Iran. (e-mail: pedram@khu.ac.ir).

A.M.Rahmani is Full Professor of Computer Engineering Department, Science and Research Branch, Islamic Azad University, Tehran, Iran. (e-mail: rahmani@srbiau.ac.ir).

Collecting, maintaining and processing data with the fore-listed attributes in the world required the use of supercomputers and large data society that was not usable and cost-effective for many companies to provide such infrastructure [4]. This huge amount of data was not managed by relational database management systems. Analyzing and processing a large amount of data can provide valuable information to business owners that adds to their competitive advantage and in medical sciences can help to discover and diagnose diseases [5]. The collection of digital information about governments, organizations, and individuals has provided many opportunities for knowledge-based decision-making. Considering the benefits of data interchange between different people and the existence of laws requiring the propagation of some different data, data interchanges are made between different individuals and parties. Also, by analyzing such data in social networks, the behavioral pattern of different users can be detected and accordingly it can offer some recommendations to the users, suitable for their taste. With the importance of finding this issue and expanding the use of this technology, along with the many benefits that come with the operation of the big data, there has also been a ground for the abuse of attackers and created serious security threats [6], [58].

Data security is discussed in any space. In big data, because sensitive data is distributed among different computational resources, unauthorized access is provided in a simpler way than centralized data structures. The expansion of distributed computing infrastructure, as well as the wide range of mobile devices, has raised concerns about the processing and sharing of users' personal and sensitive data. In order to maintain data confidentiality, various mechanisms such as encryption, access control, audit have been considered in this framework [7]. These mechanisms, despite their proper performance, face many questions about the big data and the distributed computing space [8]. For example, all data in big data is not limited to a data store and is usually distributed between multiple data warehouses and applications. The data is converted to encrypted form, many applications may not work with them. In addition, data holders and data users are usually separate from each other, so it is not effectively possible to use encryption and access control methods. Here, using anonymization methods can be useful[9]. Violation of people's privacy is one of the most important threats arising from the production, collection of data from various sources, and their aggregation and analysis.

The significant increase in data volume, their production rate and also the high variety in the data structure have caused traditional methods of data processing and management to be unable to have efficiency. Therefore, companies have taken steps to manage and process their data to use a new technology known as big data. Data stream is one of the most important types of data that their exploration discovers hidden patterns and provides valuable information to different sciences[7]. Besides these advantages, because of the aggregation of data from different sources and exploring this data, protecting people's privacy and keeping corporate business secrets reaches particular importance. Various researches have been conducted to solve this problem[9]. Each of these methods has weaknesses that make it impossible or non-optimal to use them for anonymity of a big data stream. To prevent the disclosure of personal information, individual personal IDs, such as national code, insurance numbers, and other attributes that distinguish one person directly from others, are removed from several relevant person tuples to publication [10], [11], [59], [60]. Sometimes, despite deleting these identifiers, the attackers access people's personal information using publicly available databases. For example, Netflix is the largest online movie server with millions of customers. The company tries to offer the best film based on customer interest by rating videos. This company was a sponsor for a competition in which the team that offered the best engine for the film would win the prize. During the competition, a particular person can be identified by collecting data on suggested films and public tables related to personal information of individuals [12], [13]. AOL also published a data set of phrases searched by customers, leading to violation of customers' privacy. To solve this problem, many researches have been conducted to preserve the anonymity of individuals with the least changes in the data set. In this context, method such as k-anonymity is presented. These methods change the data, similar to at least k-1 of the other person in the data set, the way they prevent a particular person from identifying a particular person in the data set [14].

In some applications, a large amount of data applies to the system in a stream of data that requires anonymity in real-time. For example, wireless sensor networks were used in smart homes as the elderly population grew and needed to be taken care of at home. These sensors are used to monitor the condition of patients and the elderly [15]. Sensors transfer patients' information to medical centers, and this information should be anonymous in real-time. If users' information is disclosed, they may be abused by individuals or advertising companies. It has been shown in [16] that using existing algorithms to anonymize these types of data streams is a difficult type of problem. Therefore, it seems necessary to provide methods to anonymize the big data stream[17], [18]. Various anonymity algorithms have been proposed by scientists and researchers, some of which are theoretically proven to provide acceptable privacy. However, like many other areas of information security, such as encryption and decryption, intrusion and detection, malware and anti-malware, the conflicts between attackers and defenses are endless, this conflict will never end, as well[19], [20].

In the rest of the article, sections are explained. In the second part, the subject literature will be presented. In the third part, solutions and researches related to the privacy of data streams and big data will be introduced and described. These solutions will be studied in five sections based on perturbation solutions, tree structure, artificial data addition, fuzzy and cluster methods. In the fourth part, different parameters of the previous algorithms will be compared and at the end of the fifth part, the conclusions and future works will be expressed.

II. Subject Literature

In the data collection phase, the trusted data publisher receives data streams from different sources. The data publisher anonymizes this data online before propagating it to the public or data mining. In healthcare systems, where real-time decision-making is one of the essential requirements of the system, data anonymizer must act in such a way as not to have a negative impact on the real-time dissemination of data [18]. Typically, in privacy data publishing methods, although t tuples is considered as Equation (1):

t (Explicit identifier, Quasi identifier(QI), Sensitive attributes, Non-sensitive attributes) (1)

The Explicit identifier includes a set of attributes such as an employee code and a national code, which can specifically distinguish the owner of tuple from the others. Quasi identifier might identify tuple owners if linked to other data sets. Sensitive attributes represent attributes such as the type of disease and the amount of income that a person does not wish to disclose. Non-sensitive attributes include all attributes that do not fall into the previous three categories. Data may be processed as data streams or as pre-saved tables, in which anonymity on this data is defined as follows. Anonymity is a way in which people's identifier or values of sensitive attributes are tried to hide from other people's eyes [20].

The information that the attacker gets by accessing public databases, his background knowledge, and publishing a new database after anonymity should not be greater than when he does not have the new database. Of course, it is known [21] it is not possible to observe such a definition entirely, and the existence of the attacker's background knowledge is partly causing the propagating of information. In anonymity of data streams, linkage attack is introduced in three categories. Like the Netflix site, in such attacks current data streams because of linkage to existing databases can lead to a violation of people's privacy.

K-anonymity

K-Anonymity, formulated by Latanya Sweeney in 2002 [26], is proposed to guarantee that the protected target cannot be distinguished from k - 1 objects. k-anonymity [27] is a privacy model that requires each record to be indistinguishable from at least k-1 records within published data even an attacker knows the values of QID’s of the victim. It provides a solution for record linkage attack. In this method, the aim is to get the out of reach original identifier and the quasi identifier data is grouped in such a way that it cannot be extracted uniquely by combining the information of a particular item [27]. In the process of data anonymity, explicit identifiers are removed and quasi identifiers are generalized. In Table I [7], information about patients is observed that explicit identifier such as name, last name, and national code have been deleted. The aim is not to expose a person's illness as a sensitive identifier in a unique way. For this purpose, in Table II, the data are categorized and anonymized in four similar clusters.

In Table III, raw data is viewed with no deletions or changes. After removing the explicit identifiers, the result is seen in Table IV and after applying 2-anonymity algorithm, a table similar to Table V, will be obtained.

Table I

Patients' Main Information

Sensitive	Non-Sensitive
Condition	Nationality	Age	Zip Code
Heart Disease	Russian	28	13053	1
Heart Disease	American	29	13068	2
Viral infection	Japanese	21	13068	3
Viral infection	American	23	13053	4
Cancer	Indian	50	14853	5
Heart Disease	Russian	55	14853	6
Viral infection	American	47	14850	7
Viral infection	American	49	14850	8
Cancer	American	31	13053	9
Cancer	Indian	37	13053	10
Cancer	Japanese	36	13068	11

Table II

4-anonymity Information of patients

Sensitive	Non-Sensitive
Condition	Nationality	Age	Zip Code
Heart Disease	*	<30	130**	1
Heart Disease	*	<30	130**	2
Viral infection	*	<30	130**	3
Viral infection	*	<30	130**	4
Cancer	*	≥40	1485*	5
Heart Disease	*	≥40	1485*	6
Viral infection	*	≥40	1485*	7
Viral infection	*	≥40	1485*	8
Cancer	*	3*	130**	9
Cancer	*	3*	130**	10
Cancer	*	3*	130**	11
Cancer	*	3*	130**	12

Table III

Sample data for privacy process

Diagnosis	Job	Gender	Ago	Sl.no.
Hepatitis	Dancer	F	21	1
Influenza	Singer	M	32	2
Malaria	Keyboardist	F	23	3
Malaria	Dancer	M	34	4
Hepatitis	Dancer	M	38	5
Influenza	Singer	F	27	6
Hepatitis	Keyboardist	F	29	7
Influenza	Music Director	M	39	8
Malaria	Engineer	M	42	9
Influenza	Doctor	M	46	10
Hepatitis	Lawyer	M	47	11
Malaria	Engineer	M	43	12

Table IV

Sample data after deleting explicit identifier

Diagnosis	Job	Gender	Ago	Sl.no.
Hepatitis	Artist	F	20-25	1
Malaria	Artist	F	20-25	2
Influenza	Artist	F	25-30	3
Hepatitis	Artist	F	25-30	4
Influenza	Artist	M	30-35	5
Malaria	Artist	M	30-35	6
Hepatitis	Artist	M	35-40	7
Influenza	Artist	M	35-40	8
Malaria	Professional	M	40-45	9
Malaria	Professional	M	40-45	10
Influenza	Professional	M	45-50	11
Hepatitis	Professional	M	45-50	12

Table V

2-anonymity

Diagnosis	Job	Gender	Age	Name	IP no.	Sl.no.
Hepatitis	Dancer	F	21	Ann	140010	1
Influenza	Singer	M	32	Emil	140011	2
Malaria	Keyboardist	F	23	Susanne	140012	3
Malaria	Dancer	M	34	David	140013	4
Hepatitis	Dancer	M	38	Jacob	140014	5
Influenza	Singer	F	27	Carolina	140015	6
Hepatitis	Keyboardist	F	29	Diana	140016	7
Influenza	Music Director	M	39	Nathaniel	140017	8
Malaria	Engineer	M	42	John	140018	9
Influenza	Doctor	M	46	Mathew	140019	10
Hepatitis	Lawyer	M	47	Paul	140020	11
Malaria	Engineer	M	43	Robert	140021	12

The natural solution for big data clustering is to develop existing clustering algorithms such as hierarchical clustering, k-means and fuzzy clustering so that they can overcome huge data volumes [28]-[30], [61].

Common clustering methods cannot be used in k-anonymity. Because of these methods, the need for each cluster to be at least k, is not considered a record. The k-anonymity problem can naturally become a clustering problem in which the goal is to find a set of clusters as an equivalence class, each of which contains at least k Records. In order to maximize the quality of data, records within each cluster should be more similar than records of other clusters.

L- Diversity

There are basically two indicators for violating the privacy of published information, an indicator of lack of diversity is a sensitive attribute in the table and the other is when the attacker has good background information about the person [63]. Positive disclosure and negative disclosure are two methods that lead to disclosure from the table. If the attacker can identify the person with a high probability, positive disclosure occurs, and if the person can correctly remove the values of sensitive attributes, negative disclosure has occurred. It can be seen 3-Diversity Table in Table VI.

Table VI

Table 3-Diversity

Diagnosis	Job	Gender	Age	Sl.no
Influenza	Artist	F	<30	1
Hepatitis	Artist	F	<30	2
Malaria	Artist	F	<30	3
Hepatitis	Artist	F	<30	4
Malaria	Professional	M	40≤	5
Malaria	Professional	M	40≤	6
Influenza	Professional	M	40≤	7
Hepatitis	Professional	M	40≤	8
Influenza	Artist	M	3*	9
Malaria	Artist	M	3*	10
Hepatitis	Artist	M	3*	11

The L-Diversity method complements k-anonymity in anonymizing of the network graph. The L-Diversity method rules that in each k-anonymity graph, each k of the same generalized record must have a different l-attribute number. The L-Diversity algorithm replaces the sensitive attribute with the number of l sensitive values. In order to run L-Diversity, changes to the network dataset must be at least log n. The attacker needs to have at least 1-l background knowledge so that he can guess the sensitive 1-l and obtain the exact sensitive attributes [25], [26].

The complexity of L-Diversity depends on the number of sensitive attributes, so by increasing the sensitive attributes, the complexity of L-Diversity will also increase. So that large data is needed to perform diversity and each sensitive attribute must be represented by other sensitive values. Increasing the l value plays an important role in reducing access to sensitive attributes using background knowledge, because in that case, the attacker needs more information to reach the sensitive attributes directly.

The results of k-anonymity algorithm show positive or negative information about the attacker's background knowledge. Positive disclosure is when the attacker directly obtains the sensitive attribute from the k-anonymity result. Negative disclosure occurs when the sensitive attribute is easily protected from attacker access. The information disclosed should not be more than the attacker's foreground knowledge [26].

The important role of L-Diversity is on sensitive attributes. This means that the L-Diversity algorithm is not implemented without sensitive attributes with any activity, unlike the k-anonymity algorithm works on non-sensitive attributes. Its sensitive attributes are those attributes whose values must hide in order to prevent the attacker from accessing users. The main difference between k-anonymity and L-Diversity algorithms is in clustering [27]. Clustering is just one step in k-anonymity, at which point records are sorted based on similarity of quasi identifier attributes. Anonymity degree is the main success criterion in the k-anonymity algorithm [29], [30].

Types of Attacks

The types of attacks are divided into three categories according to figure 1.

Fig. 1. Types of Attacks

1- Record Linkage Attack

In a linkage attack record, a small number of records are distinguished based on quasi-identifier values. These numbers of records make up a group. If the quasi identifier related to the victim is mapped to this group, the attacker can identify his victim with a high probability according to his background knowledge. To deal with these types of attacks, k-anonymity was the first model offered [22], [23]. Other models presented to contrast the record linkage attack are (x,y)-anonymity and multi-relational k-anonymity[66]. These models contrast a linkage attack record by hiding the victim's report in a group with the same QI; However, if most of the reports placed in a group with the same QI have the same value for the sensitive attributes, but without accurately identifying the victim's report, the amount of sensitive attributes (e.g. the type of disease) can be got. This mode is placed in the category of linkage attributes attacks [24].

2-Attribute Linkage attacks

In attribute linkage attacks, the attacker may not be able to accurately determine the victim's tuple, but by mapping the victim to a group of tuple-QI with the same QI and the same amount in sensitive attributes, it can get the amount of sensitive attributes of the victim with a high probability. The main idea for solving this problem is to eliminate the relationship between quasi-ID and sensitive adjective values. To solve this problem, the L-Diversity method is provided in [25]. In this method, in each group of QIs, the values of sensitive attributes should get at least l different values. In this model, if the value l is considered being k=l, k-anonymity is also guaranteed.

Other models presented to contrast attributes linkage attack are (x,y)-Privacy and (a,k)-anonymity [26] models that largely act like previous methods. If sensitive attributes are not properly distributed in the data set, the introduced models cannot contrast with the attribute linkage attack.

Suppose, in a data set, 95% of people have colds and 5% have AIDS. Now, if they have 50% AIDS and 50% cold in a QI group, a Diversity-2 condition is established. Here, the attacker can be informed of a particular person's AIDS with a 50% confidence. In the initial case, an attacker can guess a particular person's illness with a 5% confidence. T-Closeness method was presented to solve this problem in [26]. In this method, in each group of QIs, data must have a distribution close to the original data.

3-Table Linkage attack

In the attacks of the record linkage and the link of attributes, it is assumed that the attacker is aware of the existence of the victim in the published table. While sometimes the existence or absence of a person in the table can disclose sensitive information. For example, when a hospital publishes a table for AIDS patients, the knowledge of the existence or absence of a person on the table can be equal to exposing a sensitive attributes. In order to contrast this attack, a δ-present method was presented [25]. In this method, the probability of a person's presence in the published table must be limited between the two 𝛿 =(𝛿𝑚𝑖𝑛, 𝛿𝑚𝑎𝑥). This model implicitly deals with record linkage attacks and linkage attributes. In the attachment, the aggregate table compares the influential parameters of methods and algorithms. It may not imagine an end for this conflict.

Anonymity Operators

Basically, datasets do not meet privacy requirements without making changes before publication. For privacy, a sequence of anonymity operators such as generalization, suppression, permutation, anatomization and perturbation are required to apply to the dataset. In the figure 2 shows each of the anonymous operators and will be briefly described below [31]-[33].

Fig. 2. Anonymity Operators

1- Generalization and Suppression

Each generalization and suppression operator partially hides details of quasi identifier attributes. Generalization is required on quasi identifier attributes in order to anonymize the required values.

Generalization is called replacing attributes values with other values in their field [19]. In the generalization operator, the value of an attribute is extended to larger bases by the tree of classification, for example, in the address attribute, the value of Iran is replaced by the Middle East, and in the next step, the Middle East can be replaced with Asia. The generalization aims to reduce the accuracy of data in order to make confusion for unauthorized people's access to users' data. In the anonymity process, unauthorized extreme generalizations should be prevented because it reduces the value of data and even makes data useless. The problem with this method is the requirement to form a classification tree for each quasi identifier, which in most cases is done manually by an expert. In addition, there may be a difference between the classification trees, formed by different people, and it brings different results. Also, when a cellular generalization operator is used in a dataset, the set is no longer suitable for classification [24]. For example, in the previous example, if the value of the address in some records is converted from Iran to the Middle East, the range of values for the address attribute will be changed and the result of applying the classification on the new data may vary. Some classification algorithms consider the values of Iran and the Middle East separately, while the Middle East may be the same as the generalized value of Iran. This problem is called data discovery phenomenon [25].

K-anonymity records are at risk by two types of background knowledge attacks and data homogeneity. Suppose that in a cluster with record K = 4, all quasi identifier attributes are identical by generalization and there are two hobbies of smoking and painting in sensitive attributes. This is an example of background knowledge, If Alice knows Bob is in this cluster based on quasi identifier attributes (data homogeneity) and Alice has seen Bob when he's buying cigarettes (background knowledge), he can directly get all the information about Bob [12].

Usually, when real values are replaced with generalized values, the k-anonymity prerequisite (requiring each vertex to exist the same vertex for k-anonymity) is implemented on social networks. Depending on the domain, there are different ways to generalize values in it. For example, the ZIP Code 47907 can be generalized to *4790. It should be noted that we only consider quasi identifiers from a table as the generalization goal; therefore, sensitive attributes cannot be changed or deleted in the published table [1], [12], [34], [57]. For instance, in figure 3, in the business classification tree, father's professional node is more general than the engineer and lawyer child nodes, and this replacement can be made for generalization. Also, the root node of ANY-JOB is the most common node in the classification tree related to jobs [34].

Fig. 3. Business Generalization Tree

For numerical attributes, any exact value can be replaced by a numeric interval covering that value. An example of the generalization tree for numerical data (age-based) is displayed in figure 4.

Fig. 4. Age Generalization Tree

In the generalization method, a value is replaced with the value of the father node in the classification tree. In the suppression method, each value is replaced with a specific value or token, stating that the replaced value should not be disclosed. Using the suppression operator should be done carefully, as the large number of quasi identifiers with missing values reduces the quality of the dataset [12].

2- Repression

Multi-repression is another important technique often used for k-anonymity. Repressing a record means that the entire record will be removed from the table. It is important to note that although record repression reduces the number of records in the table (thus resulting in relatively high data loss), but this method can often increase the overall data quality [35]. Suppose, for example, that we want to make table 'n' to k-anonymity to the quasi identifier. To explain, suppose that the n=k+1 record with the same quasi identifier exists a remote point compared to Q. Without repressing the record, all n records are forced to generalize until the k-anonymity prerequisite of the table is met. But if the remote Q point is removed from the table, no record needs to be changed, therefore, a much less generalized k-anonymity table can be achieved by eliminating a limited number of remote points [36].

A comprehensive search for an optimal answer to the k member clustering problem, like many clustering problems, is exponential. Clustering solution is to sort records based on similarity of quasi identifier attributes. For example, when there are two sensitive attributes, diversity should be applied to both of them, which increases the lost information. The innovative solution is to combine and unite two sensitive values [37], [38].

The next step is to group the combined rows. The idea is to group rows based on the composition of sensitive values, and then select the number of L rows with differently combined sensitive values from each group to establish the L-Diversity condition. In this step, the clustering begins and the rows will be sorted again based on the similarity of the value of the quasi identifier attributes [39].

In the next step, in order to reduce the lost data, the rows are placed in a cluster with each of the rows that are most similar. For example, if the program requires anonymity of k=10 and a variation of L=5, First, sensitive attributes are divided into groups, so that the number of sensitive attributes in each group is equal to L=5. Now we select 5 rows of the first group and put it in a cluster in such a way to achieve the maximum similarity of quasi identifiers in this cluster. To make a k-anonymity condition, the cluster needs 5 more rows. To supply these 5 rows, we select one row from each group and add them to the cluster. We create all the clusters the same way. Finally, if the remaining record number was less than the number of k, we would add each record to the cluster that had the most similarity to its quasi identifier attributes [40]. Many clustering algorithms and applications require the concept of cluster center. The center of the cluster is usually a point of the domain of the dataset that represents the cluster. When the domain is a vector space of the actual number of coordinates, the most natural choice for the center of the cluster is the average vector of all points in the cluster [15].

3- Permutation and anatomization

Contrary to generalization and suppression, the permutation method does not change the quasi value of identifiers or sensitive attributes. This method tries to eliminate the connection between the two. More precisely, this method also emits quasi identifier attributes in a table and sensitive attributes in another table. Both tables have the same group ID attribute, and all recorders in the same group have the same group ID value in both tables [9], [19].

4- Perturbation

The main idea in the perturbation method is to add synthetic data to the original ones, so that the statistical information obtained from the new data is not much different from the statistical attributes of the original data. This method has two main problems compared to the introduced methods. When data is changed in this way, they become meaningless to humans and become practically useless in many data mining applications. Another problem with this method is the high overhead of synthetic data, which cannot be tolerated in big data applications that deal with large amounts of data themselves. In the following, some methods of data perturbation are briefly described [41].

A- Additive Noise

Additive noise methods are often used for numerical data such as salary. The main idea of this solution is to add the random value "r" extracted from a specific distribution to the sensitive attribute "s" and finally to publish r+s [42], [43].

B- Data Swapping

The main idea of the data swapping method is to anonymize the data set by moving the values of sensitive attributes between different records. This swapping should be done in such a way that the statistical properties of the dataset remain unchanged [19].

C- Artificial Data Production

In artificial data production method, a statistical model is extracted from the data and then points are selected as examples of this model. To publish data, sampled points are published instead of original data [19].

Anonymity Algorithms

Considering the attack models and anonymity operators, various algorithms are created in data anonymity, figure 5 introduces these algorithms and will be described below.

Fig. 5. Anonymity Algorithms

1- Anonymity of data streams based on Perturbation

Many privacy algorithms use data deformation. For this purpose, they try to reduce the granularity level of data, which reduces the usefulness of data mining algorithms to some extent; therefore, there is always a Trade-off between privacy and loss of value for information. A set of algorithms was introduced for anonymity of data streams in [25], [64]. In these algorithms, data is mixed with a random noise extracted from a statistical distribution. Random noise selection should have the least effect on the statistical attributes of the original data. The two main categories of this strategy are reviewed below [65]- [68].

Additive Perturbation

In the additive perturbation method, a private dataset is considered as Equation (2).

𝐷 = 𝑑1, d2,…,dn (2)

And for each di ∈ D, random ri noise selected from well-known statistical distributions, such as uniform distribution and Gaussian distribution, is added to the data. Finally, the D' data set is provided to the data miners as Equation (3) [69].

𝐷' = 𝑑1+𝑟1, d2+𝑟2,…,+𝑟𝑛 (3)

In order to obtain the di value, data miners use di+ ri from an Expectation Maximization Algorithm. This randomization method applies to many data mining applications, such as classification and Association Rule mining.

Multiplicative Perturbation

One of the alternative methods proposed for additive perturbation is the multiplicative perturbation method [70]. In this method, two common solutions are derived from statistics. In the first method, all D (di) components are multiplied by a random number extracted from a Gaussian distribution with an average of μ (usually one) and a variance of 𝜎2.

In the second method, the D data set is first converted with a natural logarithm function, so that the converted components are as zi=ln (di). Then, each converted zi component is added to a random ri noise extracted from a distribution equal to zero and multivariate Gaussian μ. This Gaussian distribution is considered with an average of 𝜎2= 𝑐Σ𝑧. In this regard, 0<c<1 and Σ𝑧 are equal to the covariance of the converted components, i.e., zi. The data published for the data miners will be as Equation (4).

𝐷′ = exp (𝑧1+𝑟1), exp (𝑧2+𝑟2), exp (𝑧𝑛+𝑟𝑛) (4)

These multiplicative conversions maintain the mean and variance of the original data, but cannot maintain Euclidean distance and internal multiplication of the original data. A suitable solution is provided to solve this problem in [71]-[73].

The advantage of methods based on data privacy perturbation is in the process of data collection, so the amount of noise added to each record is independent of subsequent observations, but serious problems of these methods can be noted as follows:

- High noise levels make it very difficult to analyze anonymized data.

- These methods are only used to anonymize numerical data, and it is not possible to use them for other types of data.

2- Anonymity of Data Streams based on Tree Structure

Another category of anonymity algorithms for data streams is algorithms based on tree structure. In this context, algorithms such as SWAF SKY and KIDS [74] are presented that have almost the same structure. In the following, SWAF algorithm is introduced.

SWAF Anonymity Algorithm

In figure 6, the overall structure of SWAF algorithm is mentioned. An important part of this framework is a specialization tree composed of a set of generalized nodes. In the initial phase of this framework, the slider window is considered as a static data set and obtains the specialization tree by implementing a procedure.

Fig. 6. SWAF Algorithm

At the time new tuples appear on the system, the following steps are taken to update the slider window:

A. New tuple tnews, will be added to the window and a tuple old ones will be removed from the window. For tuple tnew generalizations, the specialization tree is searched to find the most specific g-generalization node for that tuple. This node must be chosen in such a way that none of its offspring is a generalization for tnew. Hence, these tuples are added to frequency set of the g node[75]-[76].

B. The tree changes in such a way that this new node is added to the tree. If the number of tuple reaches "k" in a node, tuples will be published at the specific time.

C. When old tuple is removed from the slider window, this tuple should also be removed from the frequency set of generalized nodes in which he was a member.

There are two solutions for nodes with published tuples: On the first way, the nodes whose tuples have been published will be removed, and the generalization tree will be updated. Here, the time complexity of this algorithm is O (|S|δ logδ) and its spatial complexity is O (δ). Despite the relatively low complexity of this algorithm, because of the removal of nodes, the data loss rate is very high, which cannot be used in practice.

In the second solution, nodes whose tuples have been published are kept for reuse, which reduces the data loss rate; but its temporal and spatial complexity is proportional to the S data stream rate and equals O (S2 logS). This level of complexity is very high and unacceptable for anonymity algorithms of big data streams.

3-Anonymity of Data Streams based on the Addition of Synthetic data

Despite the introduced methods, in this group of methods, in order to anonymize the stream of data, quasi identifier attributes remain unchanged. In this method, data privacy is maintained by adding synthetically generated data to the original data.

4-Delay-Free Anonymization Framework

In delay-free anonymization method [46], [62] Qi is considered as a tuple quasi identifier attributes in Q= {qi1, qi2,...,qii}. Si is also considered as a sensitive attribute in a tuple. Data stream is also referred to as a set of tuples as (Qi, si).

The main purpose of this method is to build an L-Diversity data stream from the original data stream in real-time. This method ensures that the probability of guessing sensitive attribute related to a particular person in the data stream is less than. In order to protect data privacy, the main idea of this method for anonymity is the use of Anatomy Technic in which sensitive attributes are separated from quasi identifier attributes. Each si is then converted to an L-Diversity set by adding several synthetic tuples.

Considering figure 7, suppose that tuples t (QIt, SIt) has reached the system. The DF framework produces and publishes a record QIt that represents t quasi identifier attributes, as well as tuples ST's that includes SIt and synthetically generated data. ST and QIT are linked by group ID linkage key. For a better understanding of this method, an example is given below[78]-[79].

Fig. 7. DF Executive Framework[65].

Suppose that the data stream (age, sex and diagnosis) is generated by a hospital and the data stream anonymity by DF must satisfy the 2-Diversity requirement. At the moment, t1 ((24, male), Diag. A) as of tuple appears on the system. DF produces tuples QIT (1, (24, male) in which 1 is a random number showing group ID. Also, diag. A sensitive attribute is converted to ST set i.e., two tuples (Diag. A, 1, 1) and (Diag. B, 1, 1), thus meeting the 2-Diversity requirement and the attacker cannot guess the amount of a sensitive attribute with a probability of over 50%.

The method of delay-free anonymization framework are summarized as follows:

A. Immediate release: DF facilitates tuple-by-tuple release with the guarantee of l-diversity. DF is characterized by no accumulation delay and a low computation cost, since it immediately releases the data streams and requires only simple operations. The immediate release is a big advantage for applications that use real-time data streams[80].

B. High-level data utility: To anonymized data streams, DF artificially generates-diverse SI values instead of generalizing QIs. Thus, DF does not generate the typical information loss regarding QIs. In addition, the counterfeits in the sensitive table are also minimized by late validation[81]-[83].

Despite the good efficiency of this method in terms of delay-free data publication, there are two major drawbacks in this method.

A. Adding synthetic data to the data causes, once using the data, the usefulness of the data to be reduced and the results of data mining from this data are inappropriate.

B. In environments that deal with big data streams, the volume of the main data is inherently high, so it is impossible to tolerate excessive overhead of synthetic data[84].

5-Anonymity of data streams based on fuzzy algorithms

Fuzzy Logic uses a method to protect the privacy of data streams [46]. In this method, the values of sensitive attributes in the data stream are converted to fuzzy values. This conversion protects the privacy of sensitive data in the data stream. The fuzzy values of sensitive attributes are added as a column to the structure of the recorders related to that data stream. For example, in the data stream of patients' information about a hospital and the disease related to each one, its fuzzy structure in the T-time window can be considered as Table VII. In this table, columns related to the sex and age of people who are non-sensitive are displayed as normal, and the column related to diseases is displayed in fuzzy format. Now to know each person's illness, it is needed to have a fuzzy matrix.

Table VII

Fuzzy Anonymity Data Sample [82]

Overall fuzzy value	Sensitive data			Non sensitive data
Overall fuzzy value	Diagnosis	Name of patient	Diseases name	Age	Gender	Pid
0.5	0.6	0.4	0.6	23	M	120
0.4	0.4	0.5	0.3	30	F	121
0.5	0.5	0.2	0.8	25	M	122

6- Anonymity of Clustering Data Streams

In data clustering anonymity approach, commom is placed in a cluster so that each cluster has at least a tuple k. Then these tuples are published using cluster generalization. In the following, the algorithms in this field are discussed.

Anonymity Algorithms

In the following, in figure 8 five anonymity algorithms are investigated.

Fig. 8. Anonymity Algorithms

1-CASTLE Anonymity Algorithm

CASTLE is the first data stream anonymity algorithm with both k-anonymity and L-Diversity properties [39] [45], [51]. The main mechanism of this algorithm for data anonymity is the use of the clustering process continuously.

In this algorithm, two sets of clusters are kept.

k-anonymity cluster

k-anonymity Non-clusters

Disadvantages of CASTLE Anonymity Algorithm

k-anonymity clusters reuse in order to publish data and clusters within the non-k-anonymity set, depending on the quantity of tuples reached the system, are merged or separated to create a new k-anonymity cluster and use for publishing tuples. By using the k-anonymity cluster set, CASTLE has been able to significantly reduce the data loss rate while maintaining data anonymity, though this algorithm has the following fundamental disadvantages[53].

- This algorithm does not limit the size of the k-anonymity cluster set, causing the size of this set to grow linearly with the main data stream that means for every tuple of the main data streams, it is checked whether there is a suitable cluster or not. Therefore, the overall complexity of this algorithm is O (|S|2), which is very high for data streams.

- In CASTLE Algorithm, the size of each cluster is not limited, consequently, the size of each cluster can be proportional to the size of the data stream | 𝑆 | grow. Since CASTLE selects clusters with the lowest increase in data loss rate, the largeness of clusters increases calculations and its temporal and location complexity from O (|S|2) order.

- In this algorithm, L-Diversity property is not strong enough. Castle's algorithm only examines that l there is a different value than the sensitive attribute in a cluster, but as studied in [45], besides l different values of sensitive attributes per cluster, the number of tuples with the same sensitive attribute value in each cluster should not be more. In this regard, | 𝐶 | Indicates size.

As mentioned above, CASTLE forms a few of very large size clusters in which all a tuple of the same data streams are placed, So intermittently these clusters need to be converted nto smaller clusters, which is time-consuming operations, and to prevent the over-enlargement of clusters, in [45], provide a simple solution called B-CASTLE. In this algorithm, α threshold limit is considered for cluster size. This threshold limit only allows tuples to be added to the cluster if the size of the cluster is less than α. Considering this threshold limit, the size of clusters in B-CASTLE is much more appropriate than the CASTLE algorithm; However, because of the unlimited size of the k-anonymity cluster set, the complexity of this algorithm is like CASTLE algorithm and O (|S|2).

2-FAANST Anonymity Algorithm

Despite Castle Algorithm, which publishes data continuously, FAANST Algorithm [46] publishes data in certain intervals that are as acceptable as the delay in the system. For this purpose, the algorithm has considered a buffer to keep the data before publication. For the publication of tuples, first, every tuples cluster with existing clusters is investigated in order to be published by generalizing the cluster if appropriate. The rest of the tuples that do not fit in any of the existing clusters is divided into different clusters, using clustering algorithms such as k-mean. Tuples clusters with tuple clusters greater than k are immediately published, and if the cluster has a small loss rate, it will be added to the existing cluster collection. The rest of the clusters remain for the next periods of emissions, as well. Because this method uses a clustering algorithm and cluster sizes are O (k), this algorithm works faster than castle method.

Disadvantages of FAANST Anonymity Algorithm

- In this algorithm, such as CASTLE, there is no limit to the set of clusters, so its temporal and spatial complexity is O (|S|2) which is not appropriate for data streams.

- This method does not support deductive data, and only numerical data have been investigated in this study.

3-FADS Anonymity Algorithm

FADS [46],[51] is a cluster-based algorithm, designed to anonymize data streams. This algorithm reads a maximum of δ tuple inputs and stores it in a buffer. Then, for each tuples k, a cluster is formed and keeps the formed clusters in one set for later use. The main difference between FAANST and CASTLE algorithms is considering a time limit (TKC). This time limit is allocated to each cluster at the time of formation, and when it passes, the cluster will be removed from the existing clusters.

This solution prevents the rapid growth of existing cluster sets and reduces the complexity of this algorithm compared to previous methods. The temporal complexity of this algorithm is O (| S|), that is proportional to the size of the original linear data stream. Also, the spatial complexity of the FADS algorithm is the order O (C) that C is considered a constant. Although the algorithm has a good efficiency for anonymity of data streams, but it also has disadvantages.

Disadvantages of FADS Anonymity Algorithm

- This algorithm does not check the time left for tuples in each round of the algorithm implementation, which can lead to the release of data after the permitted time, which increases the data loss rate and removes the system from real-time mode.

- This algorithm is designed and implemented in a centralized mode. Therefore, it is not appropriate for anonymity of big data streams.

4- TPTDS Algorithm

The studied algorithms are not applicable due to lack of efficiency for big data and computing environments such as cloud computing or Internet. In Two-Phase Top-Down Specialization (TPTDS) method, in order to anonymize the data, a top-down specialization algorithm is used based on the map program in the cloud platform [47]-[49]. For optimal use of parallel processing power in cloud computing, the specialization process is divided into two separate sections. In the first part, the original dataset is divided into smaller data sets. Then this dataset is anonymized in parallel and produces the middle results. In the second part, the middle results are integrated and processed in order to reach k-anonymity. Map tools have been used in both mentioned phases. A group of map operators is designed in the framework of Hadoop, which they handle data anonymity. By using this method, TPTDS algorithm has a considerable advantage over other tasks in this field in terms of scalability and efficiency.

Disadvantages of TPTDS Anonymity Algorithm

- The execution cost of the algorithm is high.

- In the initial phases, the response rate and anonymity management of the algorithm is low.

- Algorithm is slow.

5- FAST Algorithm

FAST [46], [51] is a cluster-based algorithm, like the FADS algorithm. It reads as tuple inputs as δ and selects the first tuple with k-1. It generalizes the formed cluster and compares its data loss with another set of tuples - K clusters that have kept the algorithm up to this point. If the amount of data loss is lower for the cluster itself, the first tuples are published by generalization along with the other k-1. Otherwise,

The first tuples will be sent with the best generalization observed, and the algorithm will be re-implemented for the other tuples k-1. Also, in order to maintain the system real-time, an expiration time parameter is calculated in the algorithm, and if this time is finished for a tuple, tuples will be published immediately by the suppression method.

Disadvantages of FAST Anonymity Algorithm

- It is noteworthy that considerable time is wasted in the algorithm phases, because of the use of KNN algorithm for categorizing tuples inputs and re-implementation of the algorithm for the returned tuples k-1.

- Data loss is increased for tuple expiring data by the use of generalization method and suppression.

According to the Table VIII, proposed algorithms such as SWAF, DF is not inherently appropriate and applicable in the big data domain and only acts on numerical data, which has a high data loss rate. However, good algorithms such as FAST and TPTDS can be implemented in the big data environment and have good conditions in terms of latency rate and data loss rate [54]-[56].

[1]

Table VIII

Comparison Table of Anonymity Algorithms

III. Related Works

In the following, ten related works have been introduced and reviewed.

In the following, we will introduce and review 10 related works that have recently been investigated and generally have used K-anonymity method to anonymized data and maintain data confidentiality for publication. Most of these researches have used the implementation of their model and architecture in data macro datasets on adult datasets. This dataset contains 48842 samples, 14 features, numerical data type, and 6465 missing values [50].

1- Macwan, et al. [28], they have discussed simple anonymity on social networks data where nodes are renamed and the graph structure is not changed. In this way, the attacker, with some background knowledge, can identify the person and understand his communications. Because of this weakness arising from background knowledge, other researchers presented various derivatives of k-anonymity, which will be investigated further.

- k-candidate anonymity of Macwan, et al. [28], A generalization was presented on k-anonymity to ensure that the attacker has at least one level of uncertainty about re-identification in each node in the graph. The method states that not any person can be identified from the other person's k in that graph, so the probability that a person will be identified is not higher than .

- k-degree anonymity of Macwan, et al. [28], in this study, it was found that each node has a common degree with at least 1-k of the other node in the graph. This prevents the attacker from identifying people with basic knowledge of the nodes in the graph. They defined an anonymization problem, which is followed by a G-given graph with an anonymization of k_degree obtained from the G graph and containing the least possible changes of the graph.

2- Kiabod, et al. [50], they provided an algorithm that deals with attacks in which the attacker confronts user privacy by having some knowledge about a person's neighbors and friends. This algorithm tries to add additional edges to the input graph using two data generalization techniques and also to anonymize it by using some extra methods. In summary, this algorithm tries to create vertices to components by scrolling all the vertices of the input graph and then pouring them into components from the perspective of the isomorphic neighboring groups.

Two important drawbacks in this algorithm is that first, a special generalization techniques have been used that can only be applied in specific environments and samples and are not relatively general, as well. Second, during implementing of the algorithm for decision making, whether generalization is used or a new edge is added to the graph. Priority allocation between these two operations is a function used to estimate the cost of each of these two operations, which is ambiguous and random.

3- Zheng, et al. [12], an algorithm use greedy algorithms to anonymize the size of the vertices degree. The input of this algorithm is the G graph with n vertex and the k value as anonymization. The output of this algorithm is a G' graph with anonymized degree k. This algorithm first sorts all the vertexes of the G graph descending in terms of their degree. Then, in each round, by calling the greedy routine, it calculates the number of edges needed to add to the main graph and this method anonymizes the graph.

The advantage of this method is that this algorithm can always produce a valid graph with anonymized degree k, because in the worst case, the graph will produce the equivalent of the G graph, but the problem with this algorithm is that it only deals with one type of privacy attacks and has certain applications; Because of social media, privacy attacks of a degree are rare compared to other attacks, including neighborhood attacks. The attacker may also disclose the added invalid edges with little background knowledge with n times of network attack. The above five algorithms have been investigated for anonymity of social network graph, which will not be done much in the present study, but the algorithms were investigated because of thematic proximity.

4- Victor, et al. [7], According to figure 9, they suggested the idea of publishing the results of a request by adding noise to the data, so the attacker cannot get information with 100% confidence. The results obtained for each individual result from the total database data, not a specific tuple published in the table. A random function K (Equation (5)) generates ε value for all D and D' data sets that differ in a row at most.

(5)

The value of ε depends on the type of data as well as the request. This model can protect privacy even when the attacker has background information. Even if the person does not give the correct information for publication, it does not affect the anonymity algorithm; it is compatible with interactive and non-interactive requests.

Fig. 9. Privacy architecture of limited sensitive property

There are many limitations to RSA-SA privacy.

One of the most important limitations failing to provide assurances about the communication among record and attribute

5- Tekli, et al. [39], Besides applying k-anonymity, they also pay attention to L-Diversity applying on data. This model is based on anonymity by information clustering. For each new data entry, this data is compared with the existing clusters, and it will be done if it can be placed in any of the existing clusters. Sometimes, input data cannot be placed in any cluster. Therefore, it is necessary to do the process of cluster enlargement so that this data can be placed in the most appropriate cluster. The most appropriate cluster is the one that loses the least amount of information by applying the enlargement process. The enlargement process of a cluster is done by increasing the range of variables in each cluster. Since the enlargement of clusters leads to an increase in data loss, the most suitable cluster is the one that needs the least amount of enlargement. In this algorithm, it is necessary not to exceed the defined value of the σ from the time interval between the data entry to its propagation by the cluster. In order to meet this threshold, this algorithm examines whether there is data in any of the existing unpublished clusters that have reached its threshold time. It is necessary to publish the corresponding cluster, if such data exists. If the number of data in this cluster is greater than or equal to k, it can be published immediately. If the number of cluster data is less than k, it uses the combined strategy. Here, the cluster is combined with its nearest neighbor cluster to create a new cluster whose data number is greater than or equal to k. In the following, in order to increase the security level of their model, the authors incorporate the L-Diversity anonymity technique into their model.

This model does not consider any buffer. Therefore, the data will be placed in one of the existing clusters upon arrival. While considering an input buffer, data can be stored in the buffer. Here, in case of arrival before the time threshold, in order to lessen the amount of data loss than the previous one, σ more data-related data can be clustered by defining a new cluster.

6- Otgonbayar, et al. [51], In their proposed model, they first try to present a model for the rapid anonymity of data streams by focusing on numerical data. In the following, they also try to support non-numerical data by expanding their work. Despite the model presented at CASTLE, where data is processed and clustered immediately after arrival, a processing window is defined in this method. The three main variables in this model are K, MU and DELTA, which are used to refer to anonymity variable in k-anonymity method, size intended for processing window and threshold defined for data loss in each cluster, respectively. When the number of input information in the processing window is equal to MU, the first clustering phase is run. At this point, much information in the window may remain in it and not be included in any clusters. New data entry will be possible after several processing window Cells have been emptied. After the formation of information clusters, only clusters are accepted, which besides the number of data in it is greater than or equal to k, the amount of data loss is lower than the defined DELTA threshold. Finally, they suggest the use of Medoid-based clustering algorithms, because of the impossibility of using k-means algorithm for non-numerical attributes. One problem in this model is the lack of considering the threshold limit for keeping a data in the processing window. In this case, a data may remain in the processing window for a very long period, preventing the real-time processing of data that is considered to be the core requirements of the data streams.

7- Kaur, et al. [52], They presented a model to reduce the temporal complexity of processing data streams. In this model, the initial buffer is considered in which the input data is located. Any data stored in this buffer can only remain in this buffer within a time frame as much as δ. During this time period, δ if the data can be placed in any of the existing clusters, this will be done and then the data will be published. If the data cannot be placed in any of the existing clusters, it will be placed in a new cluster with k-1 data in the buffer that they have the closest neighborhood to the data. In case the number of data in the buffer is less than k-1, the suppression method can be used, in which case the value of each variable or part of it is replaced with the phrase *, thus placing any value instead of *. In this model, in addition to the time limit intended to hold each data in the buffer, each cluster that is created can also be used only in limited time Tkc to assign new data to it. It prevents the over-sizing of each cluster by applying this Tkc time limit. In order to examining this method, it should be noted that although this method tries to put the nearest k of data in the buffer into a cluster, but this data is not necessarily very relevant to each other, Therefore, the amount of data loss resulting from this generalization seems to be high.

8- Silva, et al. [44], In the proposed method, tree structure is used for data anonymity. In this tree, each node refers to a cluster of similar data, and the father node for each node is the generalized shape of the child node. At the highest level, the root node is located, which is the most general form defined for quasi-sensitive attributes. After entering each new data into the input buffer, it is checked which nodes of this tree are the most appropriate clusters for this data. In this model, a concept called frozen data is defined as data that meets at least one of the following conditions:

1- Data in nodes that have data numbers less than k.

2- Data that the amount of data loss resulting from their assignment to each node is higher than the δ defined threshold.

Frozen data cannot be anonymized and published by techniques such as generalization, and it is necessary to remain in the same situation until data is entered that can help data out of the frozen data category.

By examining this model, it is determined that if we encounter with the situation where the data is in the frozen category, publishing or not publishing it within an acceptable timeframe depends on the new input data, Therefore, if no appropriate data is entered, this data will still not be allowed to be published. This has damaged the real-time data processing debate.

9- Wang, et al. [32], In the proposed model, they used encryption method to maintain confidentiality in big data. In this model, data are clustered first using rule-based methods. Rule-based methods can be used in large amounts of data, so they can be effective for using big data. In this model, asymmetric encryption of Public key method is used to control access to data. In this model, three levels of security are considered. First, the main task of cryptography is raw data. For this purpose, the RSA encryption method, which is an asymmetric encryption method, is used. Then the signature for the main database manager is added to the encrypted database. In the second level, each middle user ensures the accuracy of the information after verifying the signature added to the database. Then, only a portion of customer information that is needed for data mining is accessed and performs their desired operations. With the help of rule-based methods, information clustering operations are performed. In the third level, which is referred to as the Public Layer, all users have the possibility to access the rules extracted at the second level, but the original data is kept hidden from users

As mentioned above, in this model, efforts have been made to preserve the confidentiality of users' information by using encryption methods. While in practice, cryptographic methods are not appropriate for this volume of data because of computational overhead and in situations requiring real-time processing.

10- Mehta, et al. [11], They try to provide a model for maintaining the confidentiality of big data. This model comprises three main components: information anonymity, update component, anonymity management component.

Anonymity operations are carried out in the information's anonymity. By the same token, the generalization method is used to anonymize information and thus, each data is mapped to an appropriate generalized level. By considering the entry of new

Information the update component maps them to appropriate levels. All data is mapped to the most appropriate level of anonymity after entering the information. In the meantime, we may encounter a situation where updates need to be made at the anonymity level of the database when new information arrives. In this case, the entire database will be mapped to a new level. In order to avoid the cost of recalculating the anonymized information, the anonymity information management component keeps the anonymized information. It can be observed that in this method, it has not been paid attention to all big dimensions of data. For example, considering the dimension of "variation" in the big data, no mechanism for assigning the appropriate level of anonymity according to the type of input data is considered in this method. It should be noted that in this method, if any updates are required, this kind of change applies to the entire database. Besides the high computational cost, this operation also increases the execution time and is considered as an obstacle to real-time processing. Considering the big dimensions of the data, it has been tried to achieve the lowest amount of data wasting by maintaining the confidentiality of information. For this purpose, it is tried to place the related data into a subgroup by dividing the data into appropriate subgroups. Changes will only apply to a part of the database if anonymity level is needed, because of the arrival of new information. Considering the time limit in determining the appropriate anonymity level, the grounds for real-time processing are provided.

IV. Comparison of Different Parameters of Previous Algorithms

In Table IX, different parameters of the algorithms

presented in related tasks such as the rate of loss of original data, time order, types of data that have had an anonymized approach, usability and favorability for big data technology, and the delay in responding algorithms to access the desired data were all investigated.

According to the comparison of different parameters, the results evidently show that the tools used in big data suffer from significant delays.

Table IX

Comparison Table of Previous Parameters and Algorithms

Delay Rate	Appropriate for Big Data	Generate Additional Data	Types of Data	Type Complexity	Data Loss Rate	Parameters	Algorithms
Low	Rather	Yes	Numerical	O(S)	Very High	Additive Perturbation	Disorder-Based
Low	Rather	Yes	Numerical	O(S)	Very High	Multiplicative Perturbation	Disorder-Based
High	inappropriate	No	Numerical and Categorical	O(S2 logS)	Very High	SWAF	Based on Tree Structure
Very Low	inappropriate	Yes	Numerical and Categorical		Very High	DF	Based on Artificial Data and Noise
High	inappropriate	No	Numerical		Medium	CASTEL	Based on Artificial Data and Noise
High	Inappropriate	No	Numerical and Categorical		Medium	FAANEST
High	Inappropriate	No	Numerical and Categorical	O(S)	Medium	FADS
No Important	Appropriate	No	Numerical and Categorical	O(S)	Low	TPTDS
Low	Appropriate	No	Numerical and Categorical	O(S)	Low	FAST

V. CONCLUSION

Big data are produced in a structured, semi-structured, quasi-structured and unstructured manner, which about 80% of the data are unstructured. Therefore, existing algorithms must be changed differently to protect the privacy of unlimited data. Big data with their unique properties have special requirements that need to be updated in order to use existing methods in the field of anonymity and to maintain the confidentiality of information on these data. To ensure the privacy of user data is recognized as a major research challenge. Cryptographic and stochastic algorithms are designed for smaller datasets. Thus, a new privacy framework is needed to manage the volume of information distributed among clusters.

Anonymity is not just deleting some attributes and replacing values with other values, etc. In fact, finding an optimal compromise between privacy and usability along with reducing the rate of data loss is one challenge ahead. In this paper, first, the importance of extracting and managing knowledge to gain value from the obtained knowledge was discussed. Then, different operators and anonymity methods such as anonymity of disturbance-based data streams, tree structure, real-time anonymity, based on artificial data addition, fuzzy, cluster-based methods and conventional anonymity algorithms were presented along with comparing their different parameters. Finally, the proposed architecture and model of several researches and related works that dealt with anonymity on big data were presented and factors such as data loss rate, amount of additional data production, proportionality for the big data environment and their delay were compared with each other. But some challenges, including the development of efficient and scalable algorithms appropriate for anonymity, along with the preservation of real data for big data along with time consumable space and efficient memory, can be considered as future research.

References

[1] Zhao, P., Jiang, H., Wang, C., Huang, H. , Liu, G. and Yang, Y.,2019. On the Performance of k-Anonymity Against Inference Attacks with Background Information. IEEE Internet of Things Journal,6,(pp.808 –819).

[2] Dehghantanha , A. and Choo, K.K.,2019. Handbook of Big Data and IoT Security. Springer Nature Switzerland AG.

[3]

Patnaik, S.,2020. New Paradigm of Industry 4.0 Internet of Things, Big Data & Cyber Physical Systems. Springer International Publishing.

[4] Chaudhary, A., Shaudhary, Ch., gupta, M.K., Lal, Ch. and Badal, T.,2020. Microservices in Big Data Analytics. Springer Nature Singapore Pte Ltd.

[5] Zhang, X., Liu, Ch., Nepal, S., Yang, Ch. and Chen, J.,2013. Privacy Preserving Over Big Data in Cloud System, Security, Privacy and Trust in Cloud Systems. Springer, Verlag Berlin, Heidelberg.

[6]

Salas a, J., Domingo-Ferrer, J.,2018. Some Basics on Privacy Techniques, Anonymization and their Big Data Challenges, Mathematics in Computer Science. Springer International Publishing AG,12, pp.263–274.

[7]

Victor, N., Lopez, D. and Abawajy, J.,2016. Privacy Models for Big Data: a Survey. International Journal of Big Data Intelligence,3,pp.61–75.

[8] Choo, K.K. and Dehghantanha, A., 2020. Handbook of Big Data Privacy. Springer Nature Switzerland AG.

[9]

Al-Zobbi, M., Shahrestani, S. and Ruan, Ch., 2017. Improving MapReduce privacy by implementing multi-dimensional sensitivty-based anonmization. Content courtesy of Springer Nature,4, pp.1-23.

[10] Dean, J. and Ghemawat, S.,2010. MapReduce: A Flexible Data Processing Tool. Communications of the ACM,53,pp.72-77.

[11] Mehta, B. B. and Rao U. P.,2018. Toward Scalable Anonymization for Privacy-Preserving Big Data Publishing. Recent Findings in Intelligent Computing Techniques, Advances in Intelligent Systems and Computing, Springer,2, pp.297-304.

[12] Zheng, W., Wang, Z.,Lv, T., Ma, Y. and Jia, C.,2018. K-Anonymity Algorithm Based on Improved Clustering. International Conference on Algorithms and Architectures for Parallel Processing,11335, (pp.462-476).

[13] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S. and Stoica, I.,2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. University of California, Berkeley(pp. 1-14).

[14] Mohan Rao, P. R., Krishna, S. M. and Siva Kumar, A. P. 2018. Privacy Preservation Techniques in Big Data Analytics: A Survey. Journal of Big Data, Springer Open,5, pp.1-12.

[15] Khan, S., Iqbal, Kh., Faizullah, S., Fahad, M., Ali, J. and Ahmed, W.,2019. Clustering Based Privacy Preserving of Big Data Using Fuzzification and Anonymization Operation. International Journal of Advanced Computer Science and Applications,10,pp.282–289.

[16] Dobson, A., Roy, k., Yuan, X. and Xuan, J., 2018. Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection. International Telecommunication Networks and Applications Conference(ITNAC) (pp.1-6).

[17] Ullah Bazai, S. and Jang-Jaccard, J.,2019. SparkDA: RDD-Based High-Performance Data Anonymization Technique for Spark Platform. International Conference on Network and System Security. Springer(pp.646-662).

[18] Canbay, Y. and Sagiroglu, S.,2017. Big Data Anonymization with Spark. International Conference on Computer Science and Engineering(pp.833-838).

[19] Al-Zobbi, M., Shahrestani, S. and Ruan, Ch., 2018. Experimenting Sensitivity-Based Anonymization Framework in Apache Spark. Springer Journal of Big Data,5,pp.1-26.

[20] Mittal , M., Balas, V.E., Goyal, L.Mohan and Kumar, R., 2019. Big Data Processing Using Spark in Cloud. Springer Nature Singapore Pte Ltd.

[21] He, Z., Cai, H.,2018. Latent-Data Privacy Preserving With Customized Data Utility for Social Network Data. IEEE Transactions on Vehicular Technology,67, (pp.665 – 673).

[22] Sangeetha, S. and Sadasivam, G.S.,2019. Privacy of Big Data: A Review. Springer Cham.

[23]

Ouazzani, Z., Bakkali and H.,2018. A New Technique Ensuring Privacy in Big Data: K-Anonymity Without Prior Value of the Threshold k. Elsevier-The First International Conference On Intelligent Computing in Data Sciences,127, (pp.52-59).

[24] Fei, F., lih. sh., Dai, Hu, Ch., Dou, W. and Ni, Q., 2017, A K-Anonymity Based Schema for Location Privacy Preservation, IEEE Transactions on Sustainable Computing ,4,(pp.156 – 167).

[25] Canbay, Y., Vural, Y. Sagiroglu, S.,2018. Privacy Preserving Big Data Publishing. International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism(pp. 24-29). IEEE

[26] Kayem, A., Vester, C.T., Meinel, Ch.,2016. Automated K-Anonymization and l-Diversity for Shared Data Privacy. International Conference on Database and Expert Systems Applications, 9827,(pp.105-120).

[27] Alabdullah, B., Beloff, N. and White, M.,2018. Rise of Big Data–Issues and Challenges. Saudi Computer Society National Computer Conference (pp.1-6).

[28] Macwan, K.R., patel, S.J.,2018. k-NMF Anonymization in Social Network Data Publishing. Seurity IN Computer Systems And Networks The Computer Journal,61,,pp.601–613.

[29] Reiza, A., Armengol de la Hoz, M.A. and Garcíaa, M.S.,2019. Big Data Analysis and Machine Learning in Intensive Care Units. Elsevier Espana, ˜ S.L.U. and SEMICYUC, 43, pp.416-426.

[30]

Novotny, J., Bilokon, P.A., Galiotos, A. and Deleze, F.,2019.Machine Learning and Big Data with kdb+/q, Great Britain by TJ International Ltd, Padstow, Cornwall, UK Wiley.

[31]

Bowles, M., 2019. Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics. John Wiley & Sons, Inc., Indianapolis, Indiana.

[32]

Wang, J., Cai, Zh., li, Y., yang, D., li, l. and Hao, H.,2018. Protecting Query Privacy with Differentially Private K-Anonymity in Location-Based Services. Springer-Verlag London Ltd.

[33] Arbuckle, L. and Emam, K.E., 2020. Building an Anonymization Pipeline. O'Reilly Media.

[34] Prasad Reddy, S., Raju, K.V.S.V.N. and Valli Kumari, V., 2018. Personalized Privacy Preserving Incremental Data Dissemination Through Optimal Generalization. Journal of Engineering and Applied Sciences,13.pp.4205 – 4216.

[35] Domingo-Ferrer, J.,2018. Big Data Anonymization Requirements vs Privacy Models. International Joint Conference on e-Business and Telecommunications, 2,(pp.305-312).

[36] Abdelhameed, S.A.,Moussa, Sh.M. and Khalifa, M.E.,2019. Restricted Sensitive Attributes-based Sequential Anonymization (RSA-SA) Approach for Privacy-Preserving Data Stream Publishing. Elsevier Knowledge-Based Systems,164,(pp.1-20).

[37]

Canbay, Y., Kalyoncu, A., Ercimen, M. , Dogan, A. and Sagiroglu, S.,2019. A Clustering Based Anonymization Model for Big Data. International Conference on Computer Science and Engineering Applications (pp.720-725).

[38]

Ni, S., Xie, M. and Qian,Q.,2017. Clustering Based K-Anonymity Algorithm for Privacy Preservation. International Journal of Network Security,19, pp.1062-1071.

[39] Tekli, J., Al Bouna, B., Bou Issa, Y., Kamradt, M., Haraty R.,2018. (k,l)-Clustering for Transactional Data Streams Anonymization. International Conference on Information Security Practice and Experience(pp.544–556).

[40] Guo, K., Zhang, Q.,2013. Fast Clustering-Based Anonymization Approaches with Time Constraints for Data Streams. Knowledge-Based Systems,46, pp.95–108.

[41] Wang, Y., Chi, Zh., Tong, x., Li, L.,2018. A differentially k-anonymity-based location privacy-preserving for mobile crowdsourcing systems. Procardia Computer Science,129, pp.28-34.

[42]

Eyupoglu, C., Aydin, M., Zaim, A. and Sertbas, A.,2018. An Efficient Big Data Anonymization Algorithm Based on Chaos and Perturbation Techniques. Entropy,20,pp.1-18.

[43] Nezarat, A. and Yavari, Kh.,2019. A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization. Springer International Publishing,pp. 84-97.

[44] Silva, H., Basso, T. , Moraes, R., Elia, D. and Fiore, S., 2018. A Re-Identification Risk-Based Anonymization Framework for Data Analytics Platforms. European Dependable Computing Conference (pp.101-106).

[45] Abouelmehdi, K.,Hessane, A.B. and Khaloufi, H.,2018. Big healthcare data: preserving security and privacy. Springer Jornal of Big Data,113, pp.1-18.

[46] Domingo-Ferrer, J., Peji´c-Bach, M. ,2016. Anonymization in the Time of Big Data. Springer International Publishing Switzerland.

[47] lee, R., 2019. Big Data, Cloud Computing and Data Science Engineering. Springer International Publishing.

[48] Zgurovsky, M.Z. and Zaychenko,Y.P.,2020. Big Data: Conceptual Analysis and Applications. Springer, 2020.

[49] Kumar Mishra, D.,SH Yang and X., Unal, A. 2019. Data Science and Big Data Analytics, Springer Singapore.

[50] Kiabod, M., Dehkordi, M.N., Barekatain, B.,2019.TSRAM: A time-saving k-degree anonymization method in social network. Elsevier Expert Systems with Applications.125,pp.378-396.

[51] Otgonbayar, A., Pervez, Z., Dahal, K. and Eager, S.,2018. K -VARP: K -anonymity for varied data streams via partitioning. Elsevier Information Sciences,467,pp. 238–255.

[52] Kaur, G. and Agrawal, S.,2019. Differential Privacy Framework: Impact of Quasi-identifiers on Anonymization. Springer Nature Singapore Pte Ltd.

[53]

Khan,R., Tao,X., Anjum,A., Kanwal,T., Rehman Malik,S.U.,Khan ,A., Rehman ,W.U. and Maple,C.,2020. θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records. Electronics, 9,pp.1-24.

[54]

yang, Ch.n., peng, Sh.l. and jain, L.c.,2020. Security With Intelligent Computing and Big-Data Services. International Conference on Security with Intelligent Computing and Big-data Services.

[55] Oneto, L., Navarin, N., Sperduti, A. and Anguita, D.,2020. Recent advances in big data and deep learning. Springer Proceedings of the INNS big data and deep learning.

[56] Jain, P., Gyanchandani, M. and Khare, N., 2019. Improved k-Anonymity Privacy-Preserving Algorithm Using Madhya Pradesh State Election Commission Big Data, Springer Nature Singapore Pte Ltd.

[57]

Zhao, P. , Jiang, H. , Wang, C. and Huang, H.,2018. Non-asymptotic Bound on the Performance of k-Anonymity against Inference Attacks. International Conference on High Performance Computing and Communications (pp. 570-577). IEEE.

[58]

Gholami, A. and Laure,E.,2016. Big Data Security and Privacy Issues in the Cloud. International Journal of Network Security & Its Applications,8, pp.59-79, 2016.

[59] MehtaUdai, B.B. and Rao, P.,2019. Improved l-diversity: Scalable anonymization approach for Privacy Preserving Big Data Publishing. Journal of King Saud University - Computer and Information Sciences, pp.1-8.

[60] Mehta, B.B. and Rao, U.P.,2017. Privacy Preserving Big Data Publishing: A Scalable k-Anonymization Approach using MapReduce. The Institutions of Engineering and Technology,11, 271-23.

[61]

Li, N., Huang, L., Li, Y., Sun, M.,2021. Efficient and Privacy-Preserving Multi-User Outsourced K-Means Clustering. Computer and Information Science - Published by Canadian Center of Science and Education, 14, pp.26-41.

[62] Torra, V. and Navarro-Arribas, G.,2016. Big Data Privacy and Anonymization. IFIP Advances in Information and Communication Technology ,pp.15-26.

[63] Jain, P., Gyanchandani, M., Khare, N.,2016. Big Data Privacy: A Technological Perspective and Review. Springer Journal of Big Data,3,pp.3-25.

[64] Otgonbayar, A., Pervez, Z. and Dahal, K.,2019. Partitioning Based Incremental Marginalization Algorithm for Anonymizing Missing Data Streams. International Conference on Software, Knowledge, Information Management and Applications(pp.1-7). IEEE.

[65] Kima, S., Kyoung Sungb, M. and Dohn Chung, Y.,2014. A Framework to Preserve the Privacy of Electronic Health Data Streams. Journal of Biomedical Informatics, Elsevier Inc. All rights reserved,10,pp.95-106.

[66] Tortikar,P., 2019. K-Anonymization Implementation Using Apache Spark. In Partial Fulfillment of the Requirements for the Degree of Master of science Fargo, North Dakota, pp.14-25

[67] Khakata.E N, Omwenga.V O, Msanjila. S S, 2020. Prediction of Student Learning Styles using Data Mining Techniques. Journal of Advances in Computer Engineering and Technology, 6, pp.107-118.

[68] R.J, 2019. A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS. Journal of Advances in Computer Engineering and Technology,5, pp. 93-106.

[69] Conrado Faria Gomes,V., Ribeiro de Queiroz,G. and Reis Ferreira,K.,2020. An Overview of Platforms for Big Earth Observation Data Management and Analysis. Remote Sens,12,pp.1-25.

[70] Mohammadi,M., Al-Fuqaha, A., Sorour, S. and Guizani, M.,2018. Deep Learning for IoT Big Data and Streaming Analytics: A Survey. IEEE Communications Surveys&Tutorials,20, (pp.2923–2960).

[71] Mazumder,S., Singh Bhadoria, R. and Chandra Deka, G.,2017. Distributed Computing in Big Data Analytics Concepts. Springer International Publishing AG.

[72] Karle,T. and Vora,D.,2017. PRIVACY Preservation in Big Data Using Anonymization Techniques. International Conference on Data Management, Analytics and Innovation (pp.1-8).

[73] Jha,N., Favale,T., Vassio,L., Trevisan,M. and Mellia,M., 2021. z-anonymity: Zero-Delay Anonymization for Data Streams. International Conference on Big Data (pp. 3996-4005). IEEE.

[74] Ferreira Marques,J. and Bernardino,J.,2020, Analysis of Data Anonymization Techniques. International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (pp. 235-241).

[75] N. Vokinger,K., J. Stekhoven, D. and Krauthammer,M.,2021. Lost in Anonymization - A Data Anonymization Reference Classification Merging Legal and Technical Considerations. journal-of-law-medicine-and-ethics,48, pp. 228-231.

[76] E. Olatunji,I., Rauch,J., Katzensteiner,M. and Khosla,M.,2021. A Review of Anonymization for Healthcare Data. Oxford University Press,pp.1-14.

[77] Chevrier,R.,Foufi ,V., Gaudet-Blavignac ,Ch., Robert ,A. and Lovis ,Ch.,2019. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. Journal of Medical Internet Research,21,pp.1-5.

[78] Liang ,Y.,Samavi,R.,2020. Optimization-Based K-Anonymity Algorithms. Computers & Security,93, pp.1-22.

[79] Li,Y., Chen,J., Li,Q. and Liu,A.,2020. Differential Privacy Algorithm Based on Personalized Anonymity. International Conference on Big Data Analytics (pp.226-233). IEEE.

[80] Waranya,M., W. Art and Natwichai, J.,2020. Data Anonymization: A Novel Optimal K-Anonymity Algorithm for Identical Generalization Hierarchy Data in IoT. Springer-Verlag London Ltd.,14, pp. pages 89–100.

[81] Cunha,M., Mendes ,R.,P.Vilela,J.,2021. A Survey of Privacy-Preserving Mechanisms for Heterogeneous Data Types. Computer-Science-Review,41,pp.1-10.

[82] Lin,G.,Zhao,H.,Zhao,L.,Gan,X.,Yao,Zh.,2020, Differential Privacy Information Publishing Algorithm based on Cluster Anonymity. International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering(pp.226-233).

[83] Chuang,H., Geng,Y., Yun-lu,B.,2019. Clustering Algorithm in Differential Privacy Preserving. Computer Science, 46, pp. 120-126.

[84] Kiranmayi,M., Maheswari,N.,2021. A Review on Privacy Preservation of Social Networks Using Graphs. Journal of Applied Security Research,16,pp. 190-223.

Elham Shamsinejad is a Ph.D. student in the Department of Computer Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran. Her research interests include Machine learning, Deep learning, Big Data, Data Analytics, Python Programming.

(elham.shamsinejad@yahoo.com).

تکمیل ظرفیت تحصیلات تکمیلی سال ۹۵

Touraj Banirostam is an Assistant Professor in the Department of Computer Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran. His research interests include Cognitive Science Engineering, Artificial Intelligence, Learning, Self-Management Systems.

(banirostam@iauctb.ac.ir).

Mir Mohsen Pedram received the B.Sc. degree in electrical engineering from the Isfahan University of Technology, Isfahan, Iran, 1990, and the M.Sc. and Ph.D. degrees in electrical engineering from Tarbiat Modares University, Tehran, Iran, in 1994 and 2003, respectively. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, Kharazmi University. His main areas of research are intelligent systems, machine learning, data mining, and cognitive science.

(pedram@khu.ac.ir).

چهارمین کنفرانس ملی محاسبات نرم در مهندسی برق و کامپیوتر – راهکار های موقعیت یابی افراد و اشیا در لحظه

Amir Masoud Rahmani is currently working as a Professor for Islamic Azad University, science and research branch, Tehran. He is the author/co-author of more than 220 publications in technical journals and conferences. His research interests are in the areas of distributed systems, wireless sensor networks, Internet of Things and evolutionary computing. Address: Amir Masoud Rahmani, Computer Engineering dept, Islamic Azad University, Science and Research branch, Hesarak, Ashrafi Esfahani, Poonak Square, Tehran, IRAN.

(rahmani@srbiau.ac.ir).

HF-Blocker: Detection of Distributed Denial of Service Attacks Based On Botnets
Print Date : 2015-08-01
A Review of Fraud Detection Algorithms for Electronic Payment Card Transactions
Print Date : 2021-08-01
IMNTV-Identifying Malicious Nodes using Trust Value in Wireless Sensor Networks
Print Date : 2018-05-01
The Idea Of Using The Steganography As Encryption Tool
Print Date : 2018-02-01
Detecting Bot Networks Based On HTTP And TLS Traffic Analysis
Print Date : 2020-05-01
Detecting Active Bot Networks Based on DNS Traffic Analysis
Print Date : 2019-08-01

Delay Rate	Appropriate for Big Data	Generate Additional Data	Types of Data	Type Complexity	Data Loss Rate	Parameters	Algorithms
Very Low	Rather	No	Numerical and Categorical	--	Medium	Victor, et al. [7]	Based on Artificial Data and Noise
High	Rather	No	Numerical and Categorical	O((V+E)logk)	Low	Macwan, et al.[28] k-anonymity	Graph-Based Social Networks
High	Rather	No	Numerical and Categorical	O((V+E)logk(	Low	Macwan, et al. [28] k-candidate
High	Rather	No	Numerical and Categorical	O((V+E)logk(	Low	Macwan, et al. [28] k-degree
High	Rather	No	Numerical and Categorical	O((V+E)logk(	Low	Kiabod, et al. [50]
Very Low	Rather	No	Numerical and Categorical	O((V+E)logk	Low	Zheng, et al. [12]
Very Low	Appropriate	No	Numerical and Categorical	--	Medium	Tekli, et al. [39]	Based on Clustering
High	Appropriate	No	Numerical		Medium	Otgonbayar, et al. [51]
High	Appropriate	No	Numerical and Categorical		High	Kaur, et al. [52]
Very Low	Appropriate	No	Numerical and Categorical	--	Medium	Silva, et al. [44]
High	Appropriate	No	Numerical and Categorical		Low	Wang, et al. [32]
High	Appropriate	No	Numerical and Categorical		Low	Mahta, et al. [11]

Share To

Article Url

A Review of Anonymity Algorithms in Big Data