A Review of Anonymity Algorithms in Big Data
الموضوعات :Elham shamsinejad 1 , Mir Mohsen Pedram 2 , Amir Masoud Rahamni 3 , Touraj BaniRostam 4
1 - IAUCTB
2 - Associate Professor of Electrical and Computer Engineering Department, Faculty of Engineering, Kharazmi University
3 - Department of Computer Engineering, Science and Research Branch,
Islamic Azad University, Tehran, Iran
4 - IAUCTB
الکلمات المفتاحية: Confidentiality Disclosure, Big Data, Anonymity,
ملخص المقالة :
By increasing access to high amounts of data through internet-based technologies such as social networks and mobile phones and electronic devices, many companies have considered the issues of accessing large, random and fast data along with maintaining data confidentiality. Therefore, confidentiality concerns and protection of specific data disclosure are one of the most challenging topics. In this paper, a variety of data anonymity methods, anonymity operators, the attacks that can endanger data anonymity and lead to the disclosure of sensitive data in the big data have been investigated. Also, different aspects of big data such as data sources, content format, data preparation, data processing and common data repositories will be discussed. Privacy attacks and contrastive techniques like k anonymity, neighborhood t and L diversity have been investigated and two main challenges to use k anonymity on big data will be identified, as well. Two main challenges to use k anonymity on big data will be identified. The first challenge of confidential attributes can also be as pseudo-identifier attributes, which increases the number of pseudo-identifier elements, and it may lead to the loss of great information to achieve k anonymity. The second challenge in big data is the unlimited number of data controllers are likely to lead to the disclosure of sensitive data through the independent publication of k anonymity. Then different anonymity algorithms will be presented and finally, the different parameters of time order and the consumable space of big data anonymity algorithms will be compared.
1
Journal of Advances in Computer Engineering and Technology
Received (Day Month Year)
Revised (Day Month Year)
Accepted (Day Month Year)
A Review of Anonymity Algorithms in Big Data
Abstract—By increasing access to high amounts of data through internet-based technologies such as social networks and mobile phones and electronic devices, many companies have considered the issues of accessing large, random and fast data along with maintaining data confidentiality. Therefore, confidentiality concerns and protection of specific data disclosure are one of the most challenging topics. In this paper, a variety of data anonymity methods, anonymity operators, the attacks that can endanger data anonymity and lead to the disclosure of sensitive data in the big data have been investigated. Also, different aspects of big data such as data sources, content format, data preparation, data processing and common data repositories will be discussed. Privacy attacks and contrastive techniques like k anonymity, neighborhood t and L diversity have been investigated and two main challenges to use k anonymity on big data will be identified, as well. Two main challenges to use k anonymity on big data will be identified. The first challenge of confidential attributes can also be as pseudo-identifier attributes, which increases the number of pseudo-identifier elements, and it may lead to the loss of great information to achieve k anonymity. The second challenge in big data is the unlimited number of data controllers are likely to lead to the disclosure of sensitive data through the independent publication of k anonymity. Then different anonymity algorithms will be presented and finally, the different parameters of time order and the consumable space of big data anonymity algorithms will be compared.
Index Terms—Big Data, Anonymity, Confidentiality Disclosure.
I. Introduction
D
ata is produced in various forms by different sources in the heterogeneous environment of the Internet. The volume of information stored and being produced is increasing every day and it has caused limitations in data storage and processing. Information sources, including social networks, information sensors, medical images and satellite imagery, produce structured, semi-structured and unstructured data continuously and uninterruptedly, with volumes exceeding several terabytes[1]-[3].
T.Banirostam is Assistant Professor of Computer Engineering Department, Islamic Azad University, Central Tehran Branch, Tehran, Iran. (e-mail: banirostam@iauctb.ac.ir).
M.M.Pedram is Associate Professor of Electrical and Computer Engineering Department, Faculty of Engineering, Kharazmi University, Tehran, Iran. (e-mail: pedram@khu.ac.ir).
A.M.Rahmani is Full Professor of Computer Engineering Department, Science and Research Branch, Islamic Azad University, Tehran, Iran. (e-mail: rahmani@srbiau.ac.ir).
Collecting, maintaining and processing data with the fore-listed attributes in the world required the use of supercomputers and large data society that was not usable and cost-effective for many companies to provide such infrastructure [4]. This huge amount of data was not managed by relational database management systems. Analyzing and processing a large amount of data can provide valuable information to business owners that adds to their competitive advantage and in medical sciences can help to discover and diagnose diseases [5]. The collection of digital information about governments, organizations, and individuals has provided many opportunities for knowledge-based decision-making. Considering the benefits of data interchange between different people and the existence of laws requiring the propagation of some different data, data interchanges are made between different individuals and parties. Also, by analyzing such data in social networks, the behavioral pattern of different users can be detected and accordingly it can offer some recommendations to the users, suitable for their taste. With the importance of finding this issue and expanding the use of this technology, along with the many benefits that come with the operation of the big data, there has also been a ground for the abuse of attackers and created serious security threats [6], [58].
Data security is discussed in any space. In big data, because sensitive data is distributed among different computational resources, unauthorized access is provided in a simpler way than centralized data structures. The expansion of distributed computing infrastructure, as well as the wide range of mobile devices, has raised concerns about the processing and sharing of users' personal and sensitive data. In order to maintain data confidentiality, various mechanisms such as encryption, access control, audit have been considered in this framework [7]. These mechanisms, despite their proper performance, face many questions about the big data and the distributed computing space [8]. For example, all data in big data is not limited to a data store and is usually distributed between multiple data warehouses and applications. The data is converted to encrypted form, many applications may not work with them. In addition, data holders and data users are usually separate from each other, so it is not effectively possible to use encryption and access control methods. Here, using anonymization methods can be useful[9]. Violation of people's privacy is one of the most important threats arising from the production, collection of data from various sources, and their aggregation and analysis.
The significant increase in data volume, their production rate and also the high variety in the data structure have caused traditional methods of data processing and management to be unable to have efficiency. Therefore, companies have taken steps to manage and process their data to use a new technology known as big data. Data stream is one of the most important types of data that their exploration discovers hidden patterns and provides valuable information to different sciences[7]. Besides these advantages, because of the aggregation of data from different sources and exploring this data, protecting people's privacy and keeping corporate business secrets reaches particular importance. Various researches have been conducted to solve this problem[9]. Each of these methods has weaknesses that make it impossible or non-optimal to use them for anonymity of a big data stream. To prevent the disclosure of personal information, individual personal IDs, such as national code, insurance numbers, and other attributes that distinguish one person directly from others, are removed from several relevant person tuples to publication [10], [11], [59], [60]. Sometimes, despite deleting these identifiers, the attackers access people's personal information using publicly available databases. For example, Netflix is the largest online movie server with millions of customers. The company tries to offer the best film based on customer interest by rating videos. This company was a sponsor for a competition in which the team that offered the best engine for the film would win the prize. During the competition, a particular person can be identified by collecting data on suggested films and public tables related to personal information of individuals [12], [13]. AOL also published a data set of phrases searched by customers, leading to violation of customers' privacy. To solve this problem, many researches have been conducted to preserve the anonymity of individuals with the least changes in the data set. In this context, method such as k-anonymity is presented. These methods change the data, similar to at least k-1 of the other person in the data set, the way they prevent a particular person from identifying a particular person in the data set [14].
In some applications, a large amount of data applies to the system in a stream of data that requires anonymity in real-time. For example, wireless sensor networks were used in smart homes as the elderly population grew and needed to be taken care of at home. These sensors are used to monitor the condition of patients and the elderly [15]. Sensors transfer patients' information to medical centers, and this information should be anonymous in real-time. If users' information is disclosed, they may be abused by individuals or advertising companies. It has been shown in [16] that using existing algorithms to anonymize these types of data streams is a difficult type of problem. Therefore, it seems necessary to provide methods to anonymize the big data stream[17], [18]. Various anonymity algorithms have been proposed by scientists and researchers, some of which are theoretically proven to provide acceptable privacy. However, like many other areas of information security, such as encryption and decryption, intrusion and detection, malware and anti-malware, the conflicts between attackers and defenses are endless, this conflict will never end, as well[19], [20].
In the rest of the article, sections are explained. In the second part, the subject literature will be presented. In the third part, solutions and researches related to the privacy of data streams and big data will be introduced and described. These solutions will be studied in five sections based on perturbation solutions, tree structure, artificial data addition, fuzzy and cluster methods. In the fourth part, different parameters of the previous algorithms will be compared and at the end of the fifth part, the conclusions and future works will be expressed.
II. Subject Literature
In the data collection phase, the trusted data publisher receives data streams from different sources. The data publisher anonymizes this data online before propagating it to the public or data mining. In healthcare systems, where real-time decision-making is one of the essential requirements of the system, data anonymizer must act in such a way as not to have a negative impact on the real-time dissemination of data [18]. Typically, in privacy data publishing methods, although t tuples is considered as Equation (1):
t (Explicit identifier, Quasi identifier(QI), Sensitive attributes, Non-sensitive attributes) (1)
The Explicit identifier includes a set of attributes such as an employee code and a national code, which can specifically distinguish the owner of tuple from the others. Quasi identifier might identify tuple owners if linked to other data sets. Sensitive attributes represent attributes such as the type of disease and the amount of income that a person does not wish to disclose. Non-sensitive attributes include all attributes that do not fall into the previous three categories. Data may be processed as data streams or as pre-saved tables, in which anonymity on this data is defined as follows. Anonymity is a way in which people's identifier or values of sensitive attributes are tried to hide from other people's eyes [20].
The information that the attacker gets by accessing public databases, his background knowledge, and publishing a new database after anonymity should not be greater than when he does not have the new database. Of course, it is known [21] it is not possible to observe such a definition entirely, and the existence of the attacker's background knowledge is partly causing the propagating of information. In anonymity of data streams, linkage attack is introduced in three categories. Like the Netflix site, in such attacks current data streams because of linkage to existing databases can lead to a violation of people's privacy.
K-anonymity
K-Anonymity, formulated by Latanya Sweeney in 2002 [26], is proposed to guarantee that the protected target cannot be distinguished from k - 1 objects. k-anonymity [27] is a privacy model that requires each record to be indistinguishable from at least k-1 records within published data even an attacker knows the values of QID’s of the victim. It provides a solution for record linkage attack. In this method, the aim is to get the out of reach original identifier and the quasi identifier data is grouped in such a way that it cannot be extracted uniquely by combining the information of a particular item [27]. In the process of data anonymity, explicit identifiers are removed and quasi identifiers are generalized. In Table I [7], information about patients is observed that explicit identifier such as name, last name, and national code have been deleted. The aim is not to expose a person's illness as a sensitive identifier in a unique way. For this purpose, in Table II, the data are categorized and anonymized in four similar clusters.
In Table III, raw data is viewed with no deletions or changes. After removing the explicit identifiers, the result is seen in Table IV and after applying 2-anonymity algorithm, a table similar to Table V, will be obtained.
Patients' Main Information
Sensitive | Non-Sensitive |
| ||
Condition | Nationality | Age | Zip Code | |
Heart Disease | Russian | 28 | 13053 | 1 |
Heart Disease | American | 29 | 13068 | 2 |
Viral infection | Japanese | 21 | 13068 | 3 |
Viral infection | American | 23 | 13053 | 4 |
Cancer | Indian | 50 | 14853 | 5 |
Heart Disease | Russian | 55 | 14853 | 6 |
Viral infection | American | 47 | 14850 | 7 |
Viral infection | American | 49 | 14850 | 8 |
Cancer | American | 31 | 13053 | 9 |
Cancer | Indian | 37 | 13053 | 10 |
Cancer | Japanese | 36 | 13068 | 11 |
4-anonymity Information of patients
Sensitive | Non-Sensitive |
| ||
Condition | Nationality | Age | Zip Code | |
Heart Disease | * | <30 | 130** | 1 |
Heart Disease | * | <30 | 130** | 2 |
Viral infection | * | <30 | 130** | 3 |
Viral infection | * | <30 | 130** | 4 |
Cancer | * | ≥40 | 1485* | 5 |
Heart Disease | * | ≥40 | 1485* | 6 |
Viral infection | * | ≥40 | 1485* | 7 |
Viral infection | * | ≥40 | 1485* | 8 |
Cancer | * | 3* | 130** | 9 |
Cancer | * | 3* | 130** | 10 |
Cancer | * | 3* | 130** | 11 |
Cancer | * | 3* | 130** | 12 |
Sample data for privacy process
Diagnosis | Job | Gender | Ago | Sl.no. |
Hepatitis | Dancer | F | 21 | 1 |
Influenza | Singer | M | 32 | 2 |
Malaria | Keyboardist | F | 23 | 3 |
Malaria | Dancer | M | 34 | 4 |
Hepatitis | Dancer | M | 38 | 5 |
Influenza | Singer | F | 27 | 6 |
Hepatitis | Keyboardist | F | 29 | 7 |
Influenza | Music Director | M | 39 | 8 |
Malaria | Engineer | M | 42 | 9 |
Influenza | Doctor | M | 46 | 10 |
Hepatitis | Lawyer | M | 47 | 11 |
Malaria | Engineer | M | 43 | 12 |
Sample data after deleting explicit identifier
Diagnosis | Job | Gender | Ago | Sl.no. |
Hepatitis | Artist | F | 20-25 | 1 |
Malaria | Artist | F | 20-25 | 2 |
Influenza | Artist | F | 25-30 | 3 |
Hepatitis | Artist | F | 25-30 | 4 |
Influenza | Artist | M | 30-35 | 5 |
Malaria | Artist | M | 30-35 | 6 |
Hepatitis | Artist | M | 35-40 | 7 |
Influenza | Artist | M | 35-40 | 8 |
Malaria | Professional | M | 40-45 | 9 |
Malaria | Professional | M | 40-45 | 10 |
Influenza | Professional | M | 45-50 | 11 |
Hepatitis | Professional | M | 45-50 | 12 |
Table V
2-anonymity
Diagnosis | Job | Gender | Age | Name | IP no. | Sl.no. |
Hepatitis | Dancer | F | 21 | Ann | 140010 | 1 |
Influenza | Singer | M | 32 | Emil | 140011 | 2 |
Malaria | Keyboardist | F | 23 | Susanne | 140012 | 3 |
Malaria | Dancer | M | 34 | David | 140013 | 4 |
Hepatitis | Dancer | M | 38 | Jacob | 140014 | 5 |
Influenza | Singer | F | 27 | Carolina | 140015 | 6 |
Hepatitis | Keyboardist | F | 29 | Diana | 140016 | 7 |
Influenza | Music Director | M | 39 | Nathaniel | 140017 | 8 |
Malaria | Engineer | M | 42 | John | 140018 | 9 |
Influenza | Doctor | M | 46 | Mathew | 140019 | 10 |
Hepatitis | Lawyer | M | 47 | Paul | 140020 | 11 |
Malaria | Engineer | M | 43 | Robert | 140021 | 12 |
The natural solution for big data clustering is to develop existing clustering algorithms such as hierarchical clustering, k-means and fuzzy clustering so that they can overcome huge data volumes [28]-[30], [61].
Common clustering methods cannot be used in k-anonymity. Because of these methods, the need for each cluster to be at least k, is not considered a record. The k-anonymity problem can naturally become a clustering problem in which the goal is to find a set of clusters as an equivalence class, each of which contains at least k Records. In order to maximize the quality of data, records within each cluster should be more similar than records of other clusters.
L- Diversity
There are basically two indicators for violating the privacy of published information, an indicator of lack of diversity is a sensitive attribute in the table and the other is when the attacker has good background information about the person [63]. Positive disclosure and negative disclosure are two methods that lead to disclosure from the table. If the attacker can identify the person with a high probability, positive disclosure occurs, and if the person can correctly remove the values of sensitive attributes, negative disclosure has occurred. It can be seen 3-Diversity Table in Table VI.
Table 3-Diversity
Diagnosis | Job | Gender | Age | Sl.no |
Influenza | Artist | F | <30 | 1 |
Hepatitis | Artist | F | <30 | 2 |
Malaria | Artist | F | <30 | 3 |
Hepatitis | Artist | F | <30 | 4 |
Malaria | Professional | M | 40≤ | 5 |
Malaria | Professional | M | 40≤ | 6 |
Influenza | Professional | M | 40≤ | 7 |
Hepatitis | Professional | M | 40≤ | 8 |
Influenza | Artist | M | 3* | 9 |
Malaria | Artist | M | 3* | 10 |
Hepatitis | Artist | M | 3* | 11 |
The L-Diversity method complements k-anonymity in anonymizing of the network graph. The L-Diversity method rules that in each k-anonymity graph, each k of the same generalized record must have a different l-attribute number. The L-Diversity algorithm replaces the sensitive attribute with the number of l sensitive values. In order to run L-Diversity, changes to the network dataset must be at least log n. The attacker needs to have at least 1-l background knowledge so that he can guess the sensitive 1-l and obtain the exact sensitive attributes [25], [26].
The complexity of L-Diversity depends on the number of sensitive attributes, so by increasing the sensitive attributes, the complexity of L-Diversity will also increase. So that large data is needed to perform diversity and each sensitive attribute must be represented by other sensitive values. Increasing the l value plays an important role in reducing access to sensitive attributes using background knowledge, because in that case, the attacker needs more information to reach the sensitive attributes directly.
The results of k-anonymity algorithm show positive or negative information about the attacker's background knowledge. Positive disclosure is when the attacker directly obtains the sensitive attribute from the k-anonymity result. Negative disclosure occurs when the sensitive attribute is easily protected from attacker access. The information disclosed should not be more than the attacker's foreground knowledge [26].
The important role of L-Diversity is on sensitive attributes. This means that the L-Diversity algorithm is not implemented without sensitive attributes with any activity, unlike the k-anonymity algorithm works on non-sensitive attributes. Its sensitive attributes are those attributes whose values must hide in order to prevent the attacker from accessing users. The main difference between k-anonymity and L-Diversity algorithms is in clustering [27]. Clustering is just one step in k-anonymity, at which point records are sorted based on similarity of quasi identifier attributes. Anonymity degree is the main success criterion in the k-anonymity algorithm [29], [30].
Types of Attacks
The types of attacks are divided into three categories according to figure 1.
Fig. 1. Types of Attacks
1- Record Linkage Attack
In a linkage attack record, a small number of records are distinguished based on quasi-identifier values. These numbers of records make up a group. If the quasi identifier related to the victim is mapped to this group, the attacker can identify his victim with a high probability according to his background knowledge. To deal with these types of attacks, k-anonymity was the first model offered [22], [23]. Other models presented to contrast the record linkage attack are (x,y)-anonymity and multi-relational k-anonymity[66]. These models contrast a linkage attack record by hiding the victim's report in a group with the same QI; However, if most of the reports placed in a group with the same QI have the same value for the sensitive attributes, but without accurately identifying the victim's report, the amount of sensitive attributes (e.g. the type of disease) can be got. This mode is placed in the category of linkage attributes attacks [24].
2-Attribute Linkage attacks
In attribute linkage attacks, the attacker may not be able to accurately determine the victim's tuple, but by mapping the victim to a group of tuple-QI with the same QI and the same amount in sensitive attributes, it can get the amount of sensitive attributes of the victim with a high probability. The main idea for solving this problem is to eliminate the relationship between quasi-ID and sensitive adjective values. To solve this problem, the L-Diversity method is provided in [25]. In this method, in each group of QIs, the values of sensitive attributes should get at least l different values. In this model, if the value l is considered being k=l, k-anonymity is also guaranteed.
Other models presented to contrast attributes linkage attack are (x,y)-Privacy and (a,k)-anonymity [26] models that largely act like previous methods. If sensitive attributes are not properly distributed in the data set, the introduced models cannot contrast with the attribute linkage attack.
Suppose, in a data set, 95% of people have colds and 5% have AIDS. Now, if they have 50% AIDS and 50% cold in a QI group, a Diversity-2 condition is established. Here, the attacker can be informed of a particular person's AIDS with a 50% confidence. In the initial case, an attacker can guess a particular person's illness with a 5% confidence. T-Closeness method was presented to solve this problem in [26]. In this method, in each group of QIs, data must have a distribution close to the original data.
3-Table Linkage attack
In the attacks of the record linkage and the link of attributes, it is assumed that the attacker is aware of the existence of the victim in the published table. While sometimes the existence or absence of a person in the table can disclose sensitive information. For example, when a hospital publishes a table for AIDS patients, the knowledge of the existence or absence of a person on the table can be equal to exposing a sensitive attributes. In order to contrast this attack, a δ-present method was presented [25]. In this method, the probability of a person's presence in the published table must be limited between the two 𝛿 =(𝛿𝑚𝑖𝑛, 𝛿𝑚𝑎𝑥). This model implicitly deals with record linkage attacks and linkage attributes. In the attachment, the aggregate table compares the influential parameters of methods and algorithms. It may not imagine an end for this conflict.
Anonymity Operators
Basically, datasets do not meet privacy requirements without making changes before publication. For privacy, a sequence of anonymity operators such as generalization, suppression, permutation, anatomization and perturbation are required to apply to the dataset. In the figure 2 shows each of the anonymous operators and will be briefly described below [31]-[33].
1- Generalization and Suppression
Each generalization and suppression operator partially hides details of quasi identifier attributes. Generalization is required on quasi identifier attributes in order to anonymize the required values.
Generalization is called replacing attributes values with other values in their field [19]. In the generalization operator, the value of an attribute is extended to larger bases by the tree of classification, for example, in the address attribute, the value of Iran is replaced by the Middle East, and in the next step, the Middle East can be replaced with Asia. The generalization aims to reduce the accuracy of data in order to make confusion for unauthorized people's access to users' data. In the anonymity process, unauthorized extreme generalizations should be prevented because it reduces the value of data and even makes data useless. The problem with this method is the requirement to form a classification tree for each quasi identifier, which in most cases is done manually by an expert. In addition, there may be a difference between the classification trees, formed by different people, and it brings different results. Also, when a cellular generalization operator is used in a dataset, the set is no longer suitable for classification [24]. For example, in the previous example, if the value of the address in some records is converted from Iran to the Middle East, the range of values for the address attribute will be changed and the result of applying the classification on the new data may vary. Some classification algorithms consider the values of Iran and the Middle East separately, while the Middle East may be the same as the generalized value of Iran. This problem is called data discovery phenomenon [25].
K-anonymity records are at risk by two types of background knowledge attacks and data homogeneity. Suppose that in a cluster with record K = 4, all quasi identifier attributes are identical by generalization and there are two hobbies of smoking and painting in sensitive attributes. This is an example of background knowledge, If Alice knows Bob is in this cluster based on quasi identifier attributes (data homogeneity) and Alice has seen Bob when he's buying cigarettes (background knowledge), he can directly get all the information about Bob [12].
Usually, when real values are replaced with generalized values, the k-anonymity prerequisite (requiring each vertex to exist the same vertex for k-anonymity) is implemented on social networks. Depending on the domain, there are different ways to generalize values in it. For example, the ZIP Code 47907 can be generalized to *4790. It should be noted that we only consider quasi identifiers from a table as the generalization goal; therefore, sensitive attributes cannot be changed or deleted in the published table [1], [12], [34], [57]. For instance, in figure 3, in the business classification tree, father's professional node is more general than the engineer and lawyer child nodes, and this replacement can be made for generalization. Also, the root node of ANY-JOB is the most common node in the classification tree related to jobs [34].
Fig. 3. Business Generalization Tree
For numerical attributes, any exact value can be replaced by a numeric interval covering that value. An example of the generalization tree for numerical data (age-based) is displayed in figure 4.
Fig. 4. Age Generalization Tree
In the generalization method, a value is replaced with the value of the father node in the classification tree. In the suppression method, each value is replaced with a specific value or token, stating that the replaced value should not be disclosed. Using the suppression operator should be done carefully, as the large number of quasi identifiers with missing values reduces the quality of the dataset [12].
2- Repression
Multi-repression is another important technique often used for k-anonymity. Repressing a record means that the entire record will be removed from the table. It is important to note that although record repression reduces the number of records in the table (thus resulting in relatively high data loss), but this method can often increase the overall data quality [35]. Suppose, for example, that we want to make table 'n' to k-anonymity to the quasi identifier. To explain, suppose that the n=k+1 record with the same quasi identifier exists a remote point compared to Q. Without repressing the record, all n records are forced to generalize until the k-anonymity prerequisite of the table is met. But if the remote Q point is removed from the table, no record needs to be changed, therefore, a much less generalized k-anonymity table can be achieved by eliminating a limited number of remote points [36].
A comprehensive search for an optimal answer to the k member clustering problem, like many clustering problems, is exponential. Clustering solution is to sort records based on similarity of quasi identifier attributes. For example, when there are two sensitive attributes, diversity should be applied to both of them, which increases the lost information. The innovative solution is to combine and unite two sensitive values [37], [38].
The next step is to group the combined rows. The idea is to group rows based on the composition of sensitive values, and then select the number of L rows with differently combined sensitive values from each group to establish the L-Diversity condition. In this step, the clustering begins and the rows will be sorted again based on the similarity of the value of the quasi identifier attributes [39].
In the next step, in order to reduce the lost data, the rows are placed in a cluster with each of the rows that are most similar. For example, if the program requires anonymity of k=10 and a variation of L=5, First, sensitive attributes are divided into groups, so that the number of sensitive attributes in each group is equal to L=5. Now we select 5 rows of the first group and put it in a cluster in such a way to achieve the maximum similarity of quasi identifiers in this cluster. To make a k-anonymity condition, the cluster needs 5 more rows. To supply these 5 rows, we select one row from each group and add them to the cluster. We create all the clusters the same way. Finally, if the remaining record number was less than the number of k, we would add each record to the cluster that had the most similarity to its quasi identifier attributes [40]. Many clustering algorithms and applications require the concept of cluster center. The center of the cluster is usually a point of the domain of the dataset that represents the cluster. When the domain is a vector space of the actual number of coordinates, the most natural choice for the center of the cluster is the average vector of all points in the cluster [15].
3- Permutation and anatomization
Contrary to generalization and suppression, the permutation method does not change the quasi value of identifiers or sensitive attributes. This method tries to eliminate the connection between the two. More precisely, this method also emits quasi identifier attributes in a table and sensitive attributes in another table. Both tables have the same group ID attribute, and all recorders in the same group have the same group ID value in both tables [9], [19].
4- Perturbation
The main idea in the perturbation method is to add synthetic data to the original ones, so that the statistical information obtained from the new data is not much different from the statistical attributes of the original data. This method has two main problems compared to the introduced methods. When data is changed in this way, they become meaningless to humans and become practically useless in many data mining applications. Another problem with this method is the high overhead of synthetic data, which cannot be tolerated in big data applications that deal with large amounts of data themselves. In the following, some methods of data perturbation are briefly described [41].
A- Additive Noise
Additive noise methods are often used for numerical data such as salary. The main idea of this solution is to add the random value "r" extracted from a specific distribution to the sensitive attribute "s" and finally to publish r+s [42], [43].
B- Data Swapping
The main idea of the data swapping method is to anonymize the data set by moving the values of sensitive attributes between different records. This swapping should be done in such a way that the statistical properties of the dataset remain unchanged [19].
C- Artificial Data Production
In artificial data production method, a statistical model is extracted from the data and then points are selected as examples of this model. To publish data, sampled points are published instead of original data [19].
Anonymity Algorithms
Considering the attack models and anonymity operators, various algorithms are created in data anonymity, figure 5 introduces these algorithms and will be described below.
1- Anonymity of data streams based on Perturbation
Many privacy algorithms use data deformation. For this purpose, they try to reduce the granularity level of data, which reduces the usefulness of data mining algorithms to some extent; therefore, there is always a Trade-off between privacy and loss of value for information. A set of algorithms was introduced for anonymity of data streams in [25], [64]. In these algorithms, data is mixed with a random noise extracted from a statistical distribution. Random noise selection should have the least effect on the statistical attributes of the original data. The two main categories of this strategy are reviewed below [65]- [68].
Additive Perturbation
In the additive perturbation method, a private dataset is considered as Equation (2).
𝐷 = 𝑑1, d2,…,dn (2)
And for each di ∈ D, random ri noise selected from well-known statistical distributions, such as uniform distribution and Gaussian distribution, is added to the data. Finally, the D' data set is provided to the data miners as Equation (3) [69].
𝐷' = 𝑑1+𝑟1, d2+𝑟2,…,+𝑟𝑛 (3)
In order to obtain the di value, data miners use di+ ri from an Expectation Maximization Algorithm. This randomization method applies to many data mining applications, such as classification and Association Rule mining.
Multiplicative Perturbation
One of the alternative methods proposed for additive perturbation is the multiplicative perturbation method [70]. In this method, two common solutions are derived from statistics. In the first method, all D (di) components are multiplied by a random number extracted from a Gaussian distribution with an average of μ (usually one) and a variance of 𝜎2.
In the second method, the D data set is first converted with a natural logarithm function, so that the converted components are as zi=ln (di). Then, each converted zi component is added to a random ri noise extracted from a distribution equal to zero and multivariate Gaussian μ. This Gaussian distribution is considered with an average of 𝜎2= 𝑐Σ𝑧. In this regard, 0<c<1 and Σ𝑧 are equal to the covariance of the converted components, i.e., zi. The data published for the data miners will be as Equation (4).
𝐷′ = exp (𝑧1+𝑟1), exp (𝑧2+𝑟2), exp (𝑧𝑛+𝑟𝑛) (4)
These multiplicative conversions maintain the mean and variance of the original data, but cannot maintain Euclidean distance and internal multiplication of the original data. A suitable solution is provided to solve this problem in [71]-[73].
The advantage of methods based on data privacy perturbation is in the process of data collection, so the amount of noise added to each record is independent of subsequent observations, but serious problems of these methods can be noted as follows:
- High noise levels make it very difficult to analyze anonymized data.
- These methods are only used to anonymize numerical data, and it is not possible to use them for other types of data.
2- Anonymity of Data Streams based on Tree Structure
Another category of anonymity algorithms for data streams is algorithms based on tree structure. In this context, algorithms such as SWAF SKY and KIDS [74] are presented that have almost the same structure. In the following, SWAF algorithm is introduced.
SWAF Anonymity Algorithm
In figure 6, the overall structure of SWAF algorithm is mentioned. An important part of this framework is a specialization tree composed of a set of generalized nodes. In the initial phase of this framework, the slider window is considered as a static data set and obtains the specialization tree by implementing a procedure.
At the time new tuples appear on the system, the following steps are taken to update the slider window:
A. New tuple tnews, will be added to the window and a tuple old ones will be removed from the window. For tuple tnew generalizations, the specialization tree is searched to find the most specific g-generalization node for that tuple. This node must be chosen in such a way that none of its offspring is a generalization for tnew. Hence, these tuples are added to frequency set of the g node[75]-[76].
B. The tree changes in such a way that this new node is added to the tree. If the number of tuple reaches "k" in a node, tuples will be published at the specific time.
C. When old tuple is removed from the slider window, this tuple should also be removed from the frequency set of generalized nodes in which he was a member.
There are two solutions for nodes with published tuples: On the first way, the nodes whose tuples have been published will be removed, and the generalization tree will be updated. Here, the time complexity of this algorithm is O (|S|δ logδ) and its spatial complexity is O (δ). Despite the relatively low complexity of this algorithm, because of the removal of nodes, the data loss rate is very high, which cannot be used in practice.
In the second solution, nodes whose tuples have been published are kept for reuse, which reduces the data loss rate; but its temporal and spatial complexity is proportional to the S data stream rate and equals O (S2 logS). This level of complexity is very high and unacceptable for anonymity algorithms of big data streams.
3-Anonymity of Data Streams based on the Addition of Synthetic data
Despite the introduced methods, in this group of methods, in order to anonymize the stream of data, quasi identifier attributes remain unchanged. In this method, data privacy is maintained by adding synthetically generated data to the original data.
4-Delay-Free Anonymization Framework
In delay-free anonymization method [46], [62] Qi is considered as a tuple quasi identifier attributes in Q= {qi1, qi2,...,qii}. Si is also considered as a sensitive attribute in a tuple. Data stream is also referred to as a set of tuples as (Qi, si).
The main purpose of this method is to build an L-Diversity data stream from the original data stream in real-time. This method ensures that the probability of guessing sensitive attribute related to a particular person in the data stream is less than. In order to protect data privacy, the main idea of this method for anonymity is the use of Anatomy Technic in which sensitive attributes are separated from quasi identifier attributes. Each si is then converted to an L-Diversity set by adding several synthetic tuples.
Considering figure 7, suppose that tuples t (QIt, SIt) has reached the system. The DF framework produces and publishes a record QIt that represents t quasi identifier attributes, as well as tuples ST's that includes SIt and synthetically generated data. ST and QIT are linked by group ID linkage key. For a better understanding of this method, an example is given below[78]-[79].
Fig. 7. DF Executive Framework[65].
Suppose that the data stream (age, sex and diagnosis) is generated by a hospital and the data stream anonymity by DF must satisfy the 2-Diversity requirement. At the moment, t1 ((24, male), Diag. A) as of tuple appears on the system. DF produces tuples QIT (1, (24, male) in which 1 is a random number showing group ID. Also, diag. A sensitive attribute is converted to ST set i.e., two tuples (Diag. A, 1, 1) and (Diag. B, 1, 1), thus meeting the 2-Diversity requirement and the attacker cannot guess the amount of a sensitive attribute with a probability of over 50%.
The method of delay-free anonymization framework are summarized as follows:
A. Immediate release: DF facilitates tuple-by-tuple release with the guarantee of l-diversity. DF is characterized by no accumulation delay and a low computation cost, since it immediately releases the data streams and requires only simple operations. The immediate release is a big advantage for applications that use real-time data streams[80].
B. High-level data utility: To anonymized data streams, DF artificially generates-diverse SI values instead of generalizing QIs. Thus, DF does not generate the typical information loss regarding QIs. In addition, the counterfeits in the sensitive table are also minimized by late validation[81]-[83].
Despite the good efficiency of this method in terms of delay-free data publication, there are two major drawbacks in this method.
A. Adding synthetic data to the data causes, once using the data, the usefulness of the data to be reduced and the results of data mining from this data are inappropriate.
B. In environments that deal with big data streams, the volume of the main data is inherently high, so it is impossible to tolerate excessive overhead of synthetic data[84].
5-Anonymity of data streams based on fuzzy algorithms
Fuzzy Logic uses a method to protect the privacy of data streams [46]. In this method, the values of sensitive attributes in the data stream are converted to fuzzy values. This conversion protects the privacy of sensitive data in the data stream. The fuzzy values of sensitive attributes are added as a column to the structure of the recorders related to that data stream. For example, in the data stream of patients' information about a hospital and the disease related to each one, its fuzzy structure in the T-time window can be considered as Table VII. In this table, columns related to the sex and age of people who are non-sensitive are displayed as normal, and the column related to diseases is displayed in fuzzy format. Now to know each person's illness, it is needed to have a fuzzy matrix.
Fuzzy Anonymity Data Sample [82]
Overall fuzzy value | Sensitive data | Non sensitive data | ||||
Diagnosis | Name of patient | Diseases name | Age | Gender | Pid | |
0.5 | 0.6 | 0.4 | 0.6 | 23 | M | 120 |
0.4 | 0.4 | 0.5 | 0.3 | 30 | F | 121 |
0.5 | 0.5 | 0.2 | 0.8 | 25 | M | 122 |
6- Anonymity of Clustering Data Streams
In data clustering anonymity approach, commom is placed in a cluster so that each cluster has at least a tuple k. Then these tuples are published using cluster generalization. In the following, the algorithms in this field are discussed.
Anonymity Algorithms
In the following, in figure 8 five anonymity algorithms are investigated.
1-CASTLE Anonymity Algorithm
CASTLE is the first data stream anonymity algorithm with both k-anonymity and L-Diversity properties [39] [45], [51]. The main mechanism of this algorithm for data anonymity is the use of the clustering process continuously.
In this algorithm, two sets of clusters are kept.
k-anonymity cluster
k-anonymity Non-clusters
Disadvantages of CASTLE Anonymity Algorithm
k-anonymity clusters reuse in order to publish data and clusters within the non-k-anonymity set, depending on the quantity of tuples reached the system, are merged or separated to create a new k-anonymity cluster and use for publishing tuples. By using the k-anonymity cluster set, CASTLE has been able to significantly reduce the data loss rate while maintaining data anonymity, though this algorithm has the following fundamental disadvantages[53].
- This algorithm does not limit the size of the k-anonymity cluster set, causing the size of this set to grow linearly with the main data stream that means for every tuple of the main data streams, it is checked whether there is a suitable cluster or not. Therefore, the overall complexity of this algorithm is O (|S|2), which is very high for data streams.
- In CASTLE Algorithm, the size of each cluster is not limited, consequently, the size of each cluster can be proportional to the size of the data stream | 𝑆 | grow. Since CASTLE selects clusters with the lowest increase in data loss rate, the largeness of clusters increases calculations and its temporal and location complexity from O (|S|2) order.
- In this algorithm, L-Diversity property is not strong enough. Castle's algorithm only examines that l there is a different value than the sensitive attribute in a cluster, but as studied in [45], besides l different values of sensitive attributes per cluster, the number of tuples with the same sensitive attribute value in each cluster should not be more. In this regard, | 𝐶 | Indicates size.
As mentioned above, CASTLE forms a few of very large size clusters in which all a tuple of the same data streams are placed, So intermittently these clusters need to be converted nto smaller clusters, which is time-consuming operations, and to prevent the over-enlargement of clusters, in [45], provide a simple solution called B-CASTLE. In this algorithm, α threshold limit is considered for cluster size. This threshold limit only allows tuples to be added to the cluster if the size of the cluster is less than α. Considering this threshold limit, the size of clusters in B-CASTLE is much more appropriate than the CASTLE algorithm; However, because of the unlimited size of the k-anonymity cluster set, the complexity of this algorithm is like CASTLE algorithm and O (|S|2).
2-FAANST Anonymity Algorithm
Despite Castle Algorithm, which publishes data continuously, FAANST Algorithm [46] publishes data in certain intervals that are as acceptable as the delay in the system. For this purpose, the algorithm has considered a buffer to keep the data before publication. For the publication of tuples, first, every tuples cluster with existing clusters is investigated in order to be published by generalizing the cluster if appropriate. The rest of the tuples that do not fit in any of the existing clusters is divided into different clusters, using clustering algorithms such as k-mean. Tuples clusters with tuple clusters greater than k are immediately published, and if the cluster has a small loss rate, it will be added to the existing cluster collection. The rest of the clusters remain for the next periods of emissions, as well. Because this method uses a clustering algorithm and cluster sizes are O (k), this algorithm works faster than castle method.
Disadvantages of FAANST Anonymity Algorithm
- In this algorithm, such as CASTLE, there is no limit to the set of clusters, so its temporal and spatial complexity is O (|S|2) which is not appropriate for data streams.
- This method does not support deductive data, and only numerical data have been investigated in this study.
3-FADS Anonymity Algorithm
FADS [46],[51] is a cluster-based algorithm, designed to anonymize data streams. This algorithm reads a maximum of δ tuple inputs and stores it in a buffer. Then, for each tuples k, a cluster is formed and keeps the formed clusters in one set for later use. The main difference between FAANST and CASTLE algorithms is considering a time limit (TKC). This time limit is allocated to each cluster at the time of formation, and when it passes, the cluster will be removed from the existing clusters.
This solution prevents the rapid growth of existing cluster sets and reduces the complexity of this algorithm compared to previous methods. The temporal complexity of this algorithm is O (| S|), that is proportional to the size of the original linear data stream. Also, the spatial complexity of the FADS algorithm is the order O (C) that C is considered a constant. Although the algorithm has a good efficiency for anonymity of data streams, but it also has disadvantages.
Disadvantages of FADS Anonymity Algorithm
- This algorithm does not check the time left for tuples in each round of the algorithm implementation, which can lead to the release of data after the permitted time, which increases the data loss rate and removes the system from real-time mode.
- This algorithm is designed and implemented in a centralized mode. Therefore, it is not appropriate for anonymity of big data streams.
4- TPTDS Algorithm
The studied algorithms are not applicable due to lack of efficiency for big data and computing environments such as cloud computing or Internet. In Two-Phase Top-Down Specialization (TPTDS) method, in order to anonymize the data, a top-down specialization algorithm is used based on the map program in the cloud platform [47]-[49]. For optimal use of parallel processing power in cloud computing, the specialization process is divided into two separate sections. In the first part, the original dataset is divided into smaller data sets. Then this dataset is anonymized in parallel and produces the middle results. In the second part, the middle results are integrated and processed in order to reach k-anonymity. Map tools have been used in both mentioned phases. A group of map operators is designed in the framework of Hadoop, which they handle data anonymity. By using this method, TPTDS algorithm has a considerable advantage over other tasks in this field in terms of scalability and efficiency.
Disadvantages of TPTDS Anonymity Algorithm
- The execution cost of the algorithm is high.
- In the initial phases, the response rate and anonymity management of the algorithm is low.
- Algorithm is slow.
5- FAST Algorithm
FAST [46], [51] is a cluster-based algorithm, like the FADS algorithm. It reads as tuple inputs as δ and selects the first tuple with k-1. It generalizes the formed cluster and compares its data loss with another set of tuples - K clusters that have kept the algorithm up to this point. If the amount of data loss is lower for the cluster itself, the first tuples are published by generalization along with the other k-1. Otherwise,
The first tuples will be sent with the best generalization observed, and the algorithm will be re-implemented for the other tuples k-1. Also, in order to maintain the system real-time, an expiration time parameter is calculated in the algorithm, and if this time is finished for a tuple, tuples will be published immediately by the suppression method.
Disadvantages of FAST Anonymity Algorithm
- It is noteworthy that considerable time is wasted in the algorithm phases, because of the use of KNN algorithm for categorizing tuples inputs and re-implementation of the algorithm for the returned tuples k-1.
- Data loss is increased for tuple expiring data by the use of generalization method and suppression.
According to the Table VIII, proposed algorithms such as SWAF, DF is not inherently appropriate and applicable in the big data domain and only acts on numerical data, which has a high data loss rate. However, good algorithms such as FAST and TPTDS can be implemented in the big data environment and have good conditions in terms of latency rate and data loss rate [54]-[56].
Table VIII
Comparison Table of Anonymity Algorithms
Delay Rate | Appropriate for Big Data | Generate Additional Data | Types of Data | Type Complexity | Data Loss Rate | Parameters | Algorithms |
Low | Rather | Yes | Numerical | O(S) | Very High | Additive Perturbation | Disorder-Based |
Low | Rather | Yes | Numerical | O(S) | Very High | Multiplicative Perturbation | |
High | inappropriate | No | Numerical and Categorical | O(S2 logS) | Very High | SWAF | Based on Tree Structure |
Very Low | inappropriate | Yes | Numerical and Categorical |
| Very High | DF | Based on Artificial Data and Noise |
High | inappropriate | No | Numerical |
| Medium | CASTEL | Based on Artificial Data and Noise |
High | Inappropriate | No | Numerical and Categorical |
| Medium | FAANEST | |
High | Inappropriate | No | Numerical and Categorical | O(S) | Medium | FADS | |
No Important | Appropriate | No | Numerical and Categorical | O(S) | Low | TPTDS | |
Low | Appropriate | No | Numerical and Categorical | O(S) | Low | FAST |
Delay Rate | Appropriate for Big Data | Generate Additional Data | Types of Data | Type Complexity | Data Loss Rate | Parameters | Algorithms |
Very Low | Rather | No | Numerical and Categorical | -- | Medium | Victor, et al. [7] | Based on Artificial Data and Noise |
High | Rather | No | Numerical and Categorical | O((V+E)logk) | Low | Macwan, et al.[28] k-anonymity | Graph-Based Social Networks |
High | Rather | No | Numerical and Categorical | O((V+E)logk( | Low | Macwan, et al. [28] k-candidate | |
High | Rather | No | Numerical and Categorical | O((V+E)logk( | Low | Macwan, et al. [28] k-degree | |
High | Rather | No | Numerical and Categorical | O((V+E)logk( | Low | Kiabod, et al. [50] | |
Very Low | Rather | No | Numerical and Categorical | O((V+E)logk | Low | Zheng, et al. [12] | |
Very Low | Appropriate | No | Numerical and Categorical | -- | Medium | Tekli, et al. [39] | Based on Clustering |
High | Appropriate | No | Numerical |
| Medium | Otgonbayar, et al. [51] | |
High | Appropriate | No | Numerical and Categorical |
| High | Kaur, et al. [52] | |
Very Low | Appropriate | No | Numerical and Categorical | -- | Medium | Silva, et al. [44] | |
High | Appropriate | No | Numerical and Categorical |
| Low | Wang, et al. [32] | |
High | Appropriate | No | Numerical and Categorical |
| Low | Mahta, et al. [11] |