Uncertain Data Privacy Protection Based on K-Anonymity via Anatomy

In traditional database domain, k-anonymity is a hotspot in data publishing for privacy protection. In this paper, we study how to use k-anonymity in uncertain data set, use influence matrix of background knowledge to describe the influence degree of sensitive attribute produced by QI attributes and sensitive attribute itself, use BK(L,K)-clustering to present equivalent class with diversity, and a novel UDAK-anonymity model via anatomy is proposed for relational uncertain data. We will extend our ideas for handling how to solve privacy information leakage problem by using UDAK-anonymity algorithms in another paper.


Introduction
With the rising of data mining technology and the appearances of data stream and uncertain data technology etc, individual data, the enterprise data are possibly leaked at any moments, so the data security has become nowadays the main topic of information security.With the development of Sensor network, Web service and RFID in recent years, uncertain data has become ubiquitous in economy, military, logistics, finance, telecommunication areas and so on.Uncertain data management and privacy protection have become an important research direction and a hot area of research [1].
K-anonymity [2], a model put forward by Samarati P and Sweeney L in 1998 to avoid privacy leaks, requests existence of a certain amount of unrecognizable individuals in the publicized data which make the aggressor disable to distinguish the concrete individual of privacy, and prevent the leak of individual privacy.K-anonymity got the universal concern of the academic circles, and a lot of scholars research and develop the technology on different levels.But it was a k-anonymity privacy protection model of deterministic data, currently, research in uncertain data publishing based on k-anonymity is limited, it needs a new model to represent the k-anonymity privacy protection of uncertain data.
Charu C. Aggarwal [3] presents an uncertain version of the k-anonymity model, which has the additional feature of introducing greater uncertainty for the adversary over an equivalent deterministic model.He tests the effectiveness of the privacy transformation on the problems of query estimation and classification, and show that the technique retains greater accuracy than other k-anonymity models.Wu jiawei, et al. explore several new modeling methods [4].A model space which consists of K attr , K tuple , K upperlower and K tree model is built, the K attr model uses the attribute-ors ways to describe the uncertainty in the quasi-identifier attribute(QI) values of the k-anonymity privacy protection model, the K tuple model takes QI values as relations and use the tuple-ors ways to describe the relations．The completeness and closure about these models are discussed.
This paper explores a new k-anonymity privacy protection model for relational uncertainty data by anatomy.

…
An is a generalization sequence or a functional generalization sequence [6].Fig. 1  Influence matrix based on background knowledge.Background knowledge describes the influence of a variety of SI produced by QI [7,8].Background knowledge can be acquired from domain expert, and also can be acquired by analyzing basic data directly.We use relation and sensitive degree matrix M S to describe the influence degree of SI produced by QI and SI itself, introducing notation as follows: t ij :the influence degree of NO. j SI produced by NO. i QI. b i :the weight of SI value of NO. i.
Influence matrix M S is with m rows and n+1 columns, m is the number of SI, n is the number of QI attribute, then the matrix is as follows: Advanced Engineering Forum Vols.6-7 The weight value of t ij and b i is specified by expert or experience value, for example, we can divide weight of QI in Table 1    Constructs for Uncertainty.There are two different constructs for u-tuples(uncertain tuples) [11,12]: attribute-ors: An attribute-or in a u-tuple specifies a set of alternative values for an attribute.For example, t1 contains an attribute-or in its first field and represents one of two possible tuples: (Bachelor, insomnia) or (Master, insomnia).

66
Information Technology for Manufacturing Systems III tuple-ors: A tuple-or in a u-tuple specifies a set of possible tuples.For example, the uncertainty in the previous example can also be represented by: Generalize QIs which include uncertain data with attribute-ors construct and divide the uncertainty SI into two or more fields of SI.For example, Table 6 is the original data table including uncertain data with attribute-ors construct, Table 7(deterministic uncertian data table) is the deterministic data table by generalization and partition according to Table 6.{Bachelor, Master}→ University in t1, {obesity, flu}→t2[disease1]= obesity, t2[disease2]= flu in t2, {Master, Ph.D}→ University, {41,48} → max {41,48}=48, {short breath, obesity} → t4[disease1]= short breath, t4[disease2]=obesity in t4.In Table 7, t1[QIID]=22 means the second field(Education) is the generalization value of uncertain data according to generalization hierarchies(Fig.1), and it has two uncertain data(two child nodes), t2[QIID]=SI2 means the uncertainty SI attribute was divided into two fields of SI attribute, t2 Advanced Engineering Forum Vols.6-7   Similarly, we can use UDAK-anonymity model to deal with tuple-ors construct of uncertainty, owing to the limitation of the scope, I won't discuss it in this post.

68
Information Technology for Manufacturing Systems III

Conclusion
This paper proposed specific modeling method of k-anonymity privacy protection of uncertain data via anatomy, and presented new models of k-anonymity privacy protection, UDAK-anonymity.UDAK-anonymity model not only kept the characteristic of uncertian data, but also provided more useful information for the user, improved the utility of uncertain data.Owing to the limitation of the scope, we will extend our ideas for handling how to solve privacy information leakage problem by using UDAK-anonymity algorithms in another paper.
vector in influence matrix M S , L is the amounts of different sensitive attribute value, and L makes sensitive attribute diversity, otherwise further improve the generalization or suppression.We say T satisfies BK(L,K)-clustering.BK(L,K)-anonymity with anatomy Definition 5. BK(L,K)-anonymity with anatomy.((L,K)-anonymity with anatomy based on influence matrix of background knowledge).
Let RT(A 1 ,...,A n ) be a table and QI RT be the quasi-identifier associated with it.RT is said to satisfy k-anonymity if and only if each sequence of values in RT[QI RT ] appears with at least k occurrences in Definition 1:k-anonymity

Table 1
Example of k-anonymity, where k=2 and QI={Education, Age, Sex, ZIP} provides an example of generalization hierarchies.
Fig.1 Gerneralization Hierarchies of {Education, Age, Sex} into 5 grades, 1,0.8,0.4,0.1,0, and divide weight of S in Table1into 5 grades, 0.10,0.30,0.50,0.80,0.90.The flu is common ailments, disease weight can use 0.11, because of the characteristic of local outbreaks of flu, ZIP weight use 0.8, Sex weight use 0.2 etc.The disease weight of obesity can use 0.12, the disease weight of flu and obesity are all 0.1, 0.01 and 0.02 denotes different ailment.The disease weight of short breath is 0.31, the major diseases weight of lung cancer, mammary cancer and AIDS use 0.91, 0.92 and 0.93, different disease must have different disease weight value.Then the relation and sensitive degree matrix based on Table1is as follows:

Table 2
[10]omy.Anatomy was proposed by Xiaokui Xiao et al, it means QI and SI published in different table, instead of publishing one single table with the generalized values, QI table included a unique identifier: equivalent class(QI-group) ID, SI table included equivalent class ID too, SI of each QI-group, and count.Anatomy overcomes the drawbacks of generalization.Extensive experiments confirm that anatomy permits researchers to derive from the published tables, highly accurate aggregate information about the unknown microdata, with an average error below 10%[9].For example, table 3 satisfied 3-diversity anatomy table according to table 2 by anatomy.In paper[10], they proved that the resulting published tables NSS(QI) and SS(SI) satisfy p-sensitive k-anonymity property, that is to say, anatomy satisfy p-sensitive k-anonymity property.The original data table

Table 3
The 3-diversity data table by anatomy

Table 4
Uncertain data table with attribute-ors construct

Table 5
Uncertain data with tuple-ors construct

anonymity privacy protection model for uncertain data via anatomy
First, we preprocess the uncertainty data table which makes the uncertainty data table become a deterministic data table, namely, the uncertain data has been generalized, for example, {Bachelor, Master }→University, then model the data of deterministic data table by k-clustering and anatomy.When we create a deterministic data table from an uncertainty data table, each uncertainty QI(attribute-or) is labeled with QIID or TupleID attribute in order to keep the uncertainty of QI of original data in uncertainty data table, QIID or TupleID is a appended attribute column, which value represents location in the deterministic data table.At the same time, we divide the uncertainty SI into two or more fields of SI in order to keep the uncertainty of SI of original data in uncertainty data table.That is to say, uncertain data is stored in a relational database, then we can use traditional k-anonymity model to represent the privacy protection of uncertain data.UDAK-anonymity model.UDAK-anonymity model(uncertain data anatomy k-anonymity model) is built for attribute-ors construct, modeling process needs three steps: preprocessing, BK(L,K)-clustering [13], and anatomy.can be change into deterministic data table T' by generalization and partition, we say T' is a deterministic uncertian data table.
Preprocessing 1. Create deterministic table by generalization and partition Definition 2. Deterministic uncertian data table.T (A 1 , ...,A n ) is an uncertain data table, if T

Table 6
The original data table including uncertain data with attribute-ors construct

Table 7
The deterministic data table by generalization and partition according to Table6

BK(L,K)-clustering((L,K)-Clustering based
(2)influence matrix of background knowledge).T (A 1 , ...,A n ) is a table, if T satisfies K-Clustering, and satisfies the following conditions:(1) i b c ∀ < in clustering m e , all tuples in m e should be anatomized directly.Otherwise must satisfy condition(2).Here, threshold c>0, i b is S column vector in influence matrix M S , 1 , 1,...,

Table 8
T (A 1 , ...,A n ) is table, if T satisfies BK(L,K)-clustering, then we divided T into QI table(QIT) and SI table(ST).Specifically, the QIT includes all its exact QI values, together with its group membership in a new column Group-ID.However, QIT does not store any SI values, ST retains SI statistics of each QI-group, Group-ID and count.Definition 6. UDAK-anonymity.T (A 1 , ...,A n ) is an uncertain data table, T' is a deterministic uncertian data table from T, and satisfies BK(L,K)-anonymity with anatomy, we say T' satisies UDAK-anonymity.For instance, Table 8 which were anatomized according to table 7 satisfied UDAK-anonymity.The anatomized tables according to table 7