Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

There is a general consensus that we are currently in the era of big data, with tremendous amounts of information being collected by various organizations and entities. Often, these data providers may wish to contribute their data to studies involving tasks such as statistical analysis, classification, and prediction. Because cloud service providers (CSPs) offer data providers (owners) great flexibility with respect to computation and storage capabilities, CSPs are presently the most popular avenue, through which data providers can share their data. However, the risk of leaking individuals’ private information with the straightforward uploading of data providers’ data (which may contain sensitive information such as medical or financial records, addresses and telephone numbers, or preferences of various kinds that the individuals may not want exposed) to CSPs is unacceptable. Usually, data providers protect their data’s privacy by using of encryption, but the resulting encrypted data are known to be of poor utility. As an alternative, differential privacy (DP) has been considered as a useful technique not only for protecting the privacy of data but also for boosting the utility of data. As shown in Fig. 1, there are generally two main frameworks that can be used for achieving DP, i.e., the framework with interaction for release, and the framework with no interaction for publication. In this paper, we will focus on the latter, i.e., differentially private publication.

With current techniques, there does not exist an efficient method that allows data to be sanitized only once while still preserving the data’s utility for all possible algorithms and applications. When data providers’ data need to be shared for uses involving different algorithms, applications, or privacy budgets, different kinds of noise have to be added for privacy protection. Moreover, all of these different noisy datasets must be shared or published. Consequently, the communication overhead will be enormous if the number of different algorithms/applications/privacy budgets is large. Another challenging issue is that when data providers publish their data, public entities must exist somewhere that can store all of the different kinds of datasets with different types of noise, which also inevitably requires considerable storage space.

To overcome these challenges, we propose the notion of outsourcing differentially private data publication in this paper. With the advent of cloud computing, we know an increasing number of storage and computing tasks are moving from local resources to CSPs. In our scheme, the sanitized dataset generation procedure is outsourced to a CSP by the data providers. To protect the privacy of the data, we can use the effective method, i.e., cryptographic techniques, to encrypt the data before outsourcing. However, data sanitization requires the ability to easily modify the data to be sanitized, while encrypted data generally cannot provide such an ability. Using fully homomorphic encryption technique to support ciphertext manipulation is however inefficient and requires enormous amounts of storage space and communication bandwidth. Unlike other schemes, our differentially private data publication requires only addition operations, and such operations can be realized by using the additive homomorphic encryption, which would be much more efficient than the fully homomorphic encryption. Although this approach enables us to add noise to encrypted data, encryption still leads to poor utility or heavy consume of storage/computation in many respects. Therefore, in our scheme, it allows the data evaluator to decrypt the encrypted noisy data, thereby improving the utility of the data as well as the storage cost. In this way, the data providers are not required to be online when their data are requested, which is one of the critical requirements for a practical application system.

Fig. 1.
figure 1

Frameworks for achieving differential privacy

1.1 Related Work

Differential privacy has been accepted as the main privacy paradigm in recent years, because it is based on purely mathematical calculations and provides a means of quantitative assessment. A large body of work on DP has accumulated due to its support for privacy preserving learning. There are some works which used the cryptographic methods to solve the privacy preserving for data utilization [32, 33] before the first work of DP by Dwork in 2006 [9], and the Laplace mechanism for adding noise to achieve \(\epsilon \)-DP was proposed in the seminal paper. Subsequently, McSherry designed the exponent mechanism [25] and identified the sequential and parallel properties of DP [24].

Generally, DP can be achieved via two main frameworks. In the first framework, the data evaluator’s queries are responsed under a predetermined privacy budget \(\epsilon \); as shown in Fig. 1 with red box. In the framework, we can apply the Laplace [11], Privlet [35], Linear Query [21], and Batch Query [38] techniques, among others, to obtain different responses to these queries that satisfy \(\epsilon \)-DP. However, this framework demands interaction between the data provider and the data evaluator. For this reason, in this paper, we will focus on the second framework, as depicted in Fig. 1 with the blue dashed box. The second framework not only allows a data provider to immediately publish his data after processing but also does not require any interaction. The main focus of research related to this framework is on how to design efficient and effective noise-adding algorithms to ensure DP while boosting data utility.

Typical publication methods include histogram publication [15, 17, 36, 37], partition publication [5, 7, 26, 28], contingency table publication [1, 20], and sanitized dataset publication [8, 10, 12, 14]. A histogram is an intuitive representation of the distribution of a set of data and can be used as a basis for other statistical queries or linear queries. However, histogram publication suffers from problems of redundant noise and inconsistency, meaning that different types of noise should be added for different uses. Partition publication can reduce the amount of noise that must be added. The foundation of partition publication is the careful design of an index structure to support the partitioning of the data. Using this index, the data provider can assign a privacy budget to each partition for noise addition before publication. However, determining how to assign a privacy budget is not a trivial task, and the partition index itself may leak some sensitive information; this potential risk is the core problem that remains to be solved for this method of publication. Often, data can be represented in the form of a contingency table. In fact, instead of publishing the contingency table itself for analysis, data are often published based on the statistical values of the combinations of certain variables, as represented by marginal tables. Directly adding noise to a contingency table introduces too much noise, whereas perturbing the marginal table may cause inconsistency. Qardaji proposed a noise-adding method in which the contingency table is divided into small pieces, called views [29]. This method can reduce the amount of noise introduced, but the questions of how to choose the parameters to be used for division and how to preserve the consistency between the views and the marginal tables remain challenging. The purpose of sanitized dataset publication is to ensure the protection of data privacy after the processing of the original dataset. The question on how to directly publish a sanitized dataset that satisfies DP while allowing the data evaluator to make any necessary inquiries is quite challenging. This method of dataset publication demands considerable calculation and thus is inefficient and difficult to realize. Kasiviswanathan and Blum proved that sanitized dataset publication is possible [3, 19]; however, it requires an enormous number of records.

Although the DP model provides frameworks for data evaluators to analyze databases belonging to a single party, the initial frameworks did not consider a multi-party setting. Multi-party DP was first proposed by Manas in 2010 [27], based on the aggregation of multi-party datasets to train a classifier. Subsequently, many works involving multi-party DP publication have been reported, including multi-task learning [13], multi-party deep learning [30], classifier training on private and public datasets [18] and high-dimensional data publication [31]. These works, however, have not considered outsourced computing to relieve the computational burden on data providers.

Outsourced computing is a technique for securely outsourcing expensive computations to untrusted servers, which allows resource-constrained data providers to outsource their computational workloads to cloud servers with unlimited computational resources. Chaum first proposed the notion of wallets, secure hardware installed on a client’s computer to perform expensive computations, in 1992 [4]. To protect data providers’ privacy, Chevallier presented the first algorithm for the secure delegation of elliptic-curve pairings [6]. In addition, solutions for performing meaningful computations over encrypted data using fully homomorphic encryption have emerged, although they are known to be of poor practicality [2]. Meanwhile, cloud computing using attribute-based encryption appeared, which not only supports ciphertext operation, but also provides fine-grained access control. Some of works that concentrate on privacy preserving in cloud computing have been proposed recently [16, 22, 23]. Nevertheless, in this paper, we choose additive homomorphic encryption as our basic technique to perform outsourcing computing, which offers a perfect balance of efficiency and security.

1.2 Contributions

In this paper, we propose a novel outsourced DP scheme for cloud computing. Our contributions can be summarized as follows.

  • We design an efficient outsourced DP approach using additive homomorphic encryption instead of fully homomorphic encryption. In such a way, data can be efficiently outsourced to a CSP for secure storage and DP supporting noise addition.

  • In our scheme, the data provider is not required to be involved in subsequent noise computations and related processing.

  • The security of the data against the CSP can be guaranteed under our proposed security model.

1.3 Organization

The rest of this paper is organized as follows. Some preliminary considerations are discussed in Sect. 2. In Sect. 3, the architecture of our scheme and our threat model are introduced. Then, we present the new scheme in Sect. 4. The implementation details and the evaluation of the experimental results are presented in Sect. 5. Finally, we conclude our work in Sect. 6. Also, the security analysis and some related concepts using in this paper are put in the appendix.

Table 1. Symbol cross reference table

2 Preliminaries

2.1 Differential Privacy

DP was introduced by Dwork et al. as a technique for individual privacy protection in data publication. It provides a strong privacy guarantee, ensuring that the presence or absence of an individual will not significantly affect the final output of any function (Table 1).

Definition 1

(Differential privacy)

A randomized function \(\mathcal {A}\) with a well-defined probability density \(\mathcal {P}\) satisfies \(\epsilon \)-DP if, for any two neighboring datasets \(D_1\) and \(D_2\) that differ by only one record and for any \(\mathcal {O}\in range(M)\),

$$\begin{aligned} \mathcal {P} (\mathcal {A}(D_1) =\mathcal {O}) \le e^\epsilon \cdot \mathcal {P} (\mathcal {A}(D_2) =\mathcal {O}) \end{aligned}$$
(1)

DP can usually be achieved via one of two standard mechanisms: the Laplace mechanism and the exponential mechanism. Both of them are based on the concept of the sensitivity of a function f. For ease of description, we consider only numeric values in this paper.

Definition 2

(Global sensitivity)

Let f be a function that maps a database to a fixed-size vector of real numbers. For all neighboring databases \(D_l\) and \(D_2\), the global sensitivity of f is defined as

$$\begin{aligned} \varDelta (f) = \max _{D_1,D_2}\Vert f(D_1)-f(D_2)\Vert _1 \end{aligned}$$
(2)

where \(\Vert \cdot \Vert _1\) denotes the \(L_1\) norm.

To publish data that satisfy \(\epsilon \)-DP when a query function f is applied, the principal approach is to perturb the data by adding random noise based on \(\varDelta f\) and the privacy budget \(\epsilon \). For example, for the Laplace mechanism, let Lap(\(\lambda \)) denote the Laplace probability distribution with mean zero and scale \(\lambda \). The Laplace mechanism achieves DP by adding Laplace noise to an original dataset M.

Definition 3

(Laplace mechanism)

Let m be a record in database M (\(m\in M\)), and let \(\eta \) be a random variable such that \(\eta \sim Lap(\varDelta f/\epsilon )\). The Laplace mechanism is defined as follows:

$$\begin{aligned} m' = m + \eta . (\varSigma {m} = M) \end{aligned}$$
(3)

3 Architecture

In this section, we formalize our system model, and identify the threat model and security requirements.

3.1 System Model

In our system model, four types of entities are involved in our basic scheme, namely, the trusted authority (TA), the data providers, the cloud service provider (CSP), and the data evaluator. The TA issues credentials for both data providers and data evaluators. The data providers possess data and would like to share those data for purposes such as classification or data analysis. The CSP provides the cloud storage service for the data providers. The data evaluator obtains the sanitized data from the CSP and performs the corresponding data analysis. Each data evaluator may acquire different part of dataset for different usage. However, the same data he got, the same noise the data has.

3.2 Threat Model and Security Requirements

In this work, both the data evaluator and the CSP are assumed to be honest-but-curious. The data evaluator needs to protect his trained model against the CSP. Moreover, the data evaluator will follow a specified protocol for building a correct model without obtaining incorrect results. The relationship between CSP and data evaluator is non-cooperative, which means they will not collude with each other (For example, CSP could be Google Inc. and data evaluator is Apple Inc., which is very common in our daily life). Otherwise, even they are honest, data evaluator could give CSP his secret key to get the original dataset. The privacy of the data providers’ data needs to be protected against both the CSP and the data evaluator.

For the data owner, the security means that its privacy should be protected against the other entities even if they are curious about the underlying original data. Of course, the CSP and DP are not allowed to collude with each other in our security model.

4 Our Proposed Outsourced Differential Privacy Schemes

In this section, we present several basic outsourced differential privacy (ODP) schemes. During the description, we consider a public key encryption scheme with additive homomorphic properties, i.e., (SetupKeyGenEncDecAdd), will be applied in our system.

Fig. 2.
figure 2

Single-data-provider scheme

4.1 A Straightforward Scheme

In a straightforward scheme, the typical process of differentially private data publication can be summarized as follows:

  • Setup. The data evaluator and data provider together establish the privacy budget \(\epsilon \).

  • Generation. The data provider calculates \(\varDelta f\) and then uses \(\varDelta f\) and \(\epsilon \) to generate a noisy dataset.

  • Uploading. The data provider uploads the noisy dataset to the CSP.

  • Analysis. The data evaluator obtains the noisy dataset from the CSP and analyzes this dataset.

Note that, for different data evaluators and/or different privacy budgets, the above steps should be executed repeatedly, which means more power and bandwidth will be consumed.

To overcome the disadvantages of the above scheme, we show how to provide efficient noise addition with the aid of cloud servers in next sections.

4.2 Our Scheme for Single Data Provider

Initially, we assume that there is a single data provider in the system (Fig. 2). Then, the single-data-provider scheme, as shown in Algorithm 1, is composed of the following steps:

  • Setup. In this step, the TA generates the public parameters and key pairs involved in the system. Let (pksk) be the key pair for the data evaluator, where pk is the public key and sk is the secret key. Note that, the data provider should pre-calculate the possible set of function sensitivities, denoted by \(\varDelta \mathbf F =(\varDelta f_1,\varDelta f_2, \cdots , \varDelta f_m)\), where \(f_i\) \((i = 1,\cdots ,m)\) are the functions that the data evaluator might use in his evaluations.

  • Data uploading. The data provider first receives the data evaluator’s public key pk from the TA, encrypts his dataset \(M=(m_1,m_2, \cdots , m_n)\) using the Enc(pkM) algorithm, and uploads the resulting ciphertexts \(C=(\Vert m_1\Vert ,\Vert m_2\Vert , \cdots , \Vert m_n\Vert )\) to the cloud server. Then, the data provider (and perhaps the data evaluator) determines the privacy budget \(\epsilon \). Furthermore, the parameter vector b for noise generation (e.g., for the Laplace mechanism, since the Laplace noise generation depends on \(\frac{\varDelta \mathbf F }{\epsilon }\), \(\mathbf b =(b_1,b_2,\cdots ,b_m)=(\frac{\varDelta f_1}{\epsilon },\frac{\varDelta f_2}{\epsilon },\cdots ,\frac{\varDelta f_m}{\epsilon }))\) is also sent to the cloud server.

  • Noise addition. After receiving the data from the data provider, the cloud server generates a noise component \(\eta \) (for example, in the Laplace noise mechanism, \(b_i = \frac{\varDelta f_i}{\epsilon }\) is used as the parameters to define the Laplace distributions from which to randomly draw noise) and encrypts the noise using \(Enc(pk,\eta ) = \Vert \eta \Vert \). Then, the cloud server uses the \(Add(\Vert M\Vert ,\Vert \eta \Vert )\) algorithm to add the noise to the data provider’s data (for example, in the case of Paillier encryption, the server multiplies the data provider’s encrypted data by the encrypted noise) and sends the resulting noisy data to the data evaluator.

  • Data analysis. The data evaluator first decrypts the received ciphertexts using \(Dec(sk,\Vert M+\eta \Vert )\) to obtain all of the noisy data. Based on these data, the data evaluator can successfully perform classification or apply other algorithms to the data.

figure a

4.3 A Multi-Data-Provider Scheme

Next, we present the multi-data-provider scheme (similar with Fig. 2, but more data providers). For the ease of description, we assume that each party holds a parallel data set, meaning that there is no overlap between any two parties’ databases. Each party \(P_i\) hold a portion of D, denoted by \(D_i\), such that \(\sum {D_i}=D\). The scheme, as shown in Algorithm 2, consists of four steps.

  • Setup. This step is similar to that for a single data provider. The difference is that there are k parties, denoted by \(P_1,P_2,\cdots ,P_k\), and each user \(P_i\) should pre-calculate a set of possible function sensitivities \(\varDelta \mathbf F _i\).

  • Data uploading. Each data provider \(P_i\) encrypts the dataset \(M_i=(m_{i1},m_{i2}, \cdots , m_{in})\) using the \(Enc(pk,M_i)\) algorithm. After the encryption, \(P_i\) uploads the ciphertexts \(C_i=(\Vert m_{i1}\Vert ,\Vert m_{i2}\Vert , \cdots , \Vert m_{in}\Vert )\) to the cloud server. Then, the users determine the privacy budget \(\epsilon \). The parameter vectors \(\mathbf b _i\) for noise generation are also sent to the cloud server.

  • Noise addition. Using all the noise parameter vectors \(\mathbf b _i\) \((i=1,2,\cdots ,k)\), the cloud server generates the noise \(\eta _i\) for each \(P_i\) and encrypts the noise using \(Enc(pk,\eta _i) = \Vert \eta _i\Vert \). Then, the cloud server uses the \(Add(\Vert M_i\Vert ,\Vert \eta _i\Vert )\) algorithm to add the noise to the data providers’ data and sends the noisy data to the data evaluator.

  • Data analysis. This step is the same as that for a single data provider.

figure b

4.4 Discussion: How to Add Noise Wisely and Efficiently

In the above schemes, we have assumed that the noise generation is a simple task performed by the CSP. Actually, other noise-adding schemes from other entities are also possible. In the following, we focus on discussing additional methods of noise generation for the single-data-provider schemes. Note that, similar methods can also be applied in the multi-data-provider schemes.

Noise Addition by the Data Provider. Noise can be generated by the data provider; see Fig. 3. In this method, the CSP will have no interaction with the parameter vector \(\mathbf b \) or the generated noise \(\eta \). The details of the procedure are shown in Algorithm 3.

Fig. 3.
figure 3

Noise addition by the data provider

figure c
Fig. 4.
figure 4

Noise addition by a noise server

Noise Addition by a Noise Server. Noise addition by the data provider can protect some sensitive parameters from the CSP. However, this scheme does not allow the data provider to go offline. Therefore, we propose a scheme with noise addition by a noise-generating server; see Fig. 4. We assume that there is another server, called the noise-generating server, in our system, i.e., the system now contains five types of entities: the trusted authority (TA), the data provider, the cloud service provider (CSP), the data evaluator and the noise-generating service provider (NSP). The main idea of this design is to separate the noise generation from the rest of the system. The details of the procedure are shown in Algorithm 4.

figure d

4.5 Discussion: How to Treat High-Dimensional Data

The publication of high-dimensional data can support a wide spectrum of evaluation tasks. However, the problem of how to ensure the privacy of high-dimensional data is still challenging. Generally speaking, we can handle high-dimensional data as described below.

View-Based Dimension Reduction. High-dimensional data can be presented in the form of a contingency table [29]. For most types of data evaluations, a k-way marginal contingency table (generally, \(k\le 3\)) can be used. However, directly adding noise to the complete contingency table (all of the data) to achieve data sanitization will lead to excessive noise. Therefore, determining how to add noise to obtain a high-utility k-way marginal table is the key to high-dimensional data publication. We use the following basic procedure to reduce the dimensionality of high-dimensional data before publication.

  • Build a contingency table. In this step, the data provider builds a contingency table based on his high-dimensional dataset.

  • Generate views. The data provider runs an algorithm to select the best parameters with which to generate views. (For example, suppose that we have an l-column, d-dimensional table, where \(d=8\) and \(l=100\); then, the data provider runs the algorithm to generate \(d^2+l\) sub-contingency tables, which are referred to as views.) The data provider can treat these views in the same way as his original data, following the scheme proposed above (setup, data uploading, noise generation, noise addition). A bloom filter should also be used to assign the attributes to the encrypted views.

  • Reconstruct. The noisy k-way marginal table is reconstructed based on the encrypted views, and the bloom filter is used to check whether the views can be successfully used to reconstruct the k-way marginal table.

4.6 Discussion: How to Make the Scheme More Practical

In the schemes presented above, each data provider uses the data evaluator’s public key to encrypt his own data, which means that the data provider cannot delete his local copy of his dataset to save storage space, because using the data evaluator’s public key for encryption causes the data provider to lose the ability to decrypt the dataset stored by the CSP. Inspired by Wang’s method [34], we propose our ODP scheme with proxy re-encryption technique; the details of the procedure are shown in Algorithm 5.

figure e

5 Evaluation

In this section, we evaluate the performance of our scheme in terms of functionality, computational overhead and communication overhead. All experiments were conducted on a PC with a 1.90 GHz AMD A4-3300M APU with Radeon(TM) HD Graphics and 6 GB of RAM. Also, using different kinds of noise addition method only differs nothing but in execution time during noise generation. For simplicity, we choose Laplace mechanism in this evaluation to prove our scheme’s feasibility.

Fig. 5.
figure 5

Functionality

5.1 Functionality

We used datasets acquired from the UCI machine learning repository, which can be downloaded from UCIFootnote 1, to evaluate our scheme’s functionality. To evaluate the classifier performance, we reserved \(\frac{1}{10}\) of each dataset to serve as a test dataset, and we chose \(\epsilon =0.1\) for applying the Laplace mechanism to our datasets. Figure 5 shows the accuracies of training KNN and Naive Bayes classifiers for Letter Recognition, EEG Eye State, CPU, and Glass. From the figure, we can see that training a classifier on the sanitized dataset instead of the original dataset exerts little influence on the classifier performance.

5.2 Computational Overhead

We use the Paillier scheme as our homomorphic encryption algorithm. To perform homomorphic addition on ciphertexts, we should treat each attribute as a plaintext input and calculate its related ciphertext. Thus, the total number of encryption operations is counts = records * attributes.

Fig. 6.
figure 6

Computational overhead

Fig. 7.
figure 7

Communication overhead

As seen in Fig. 6, it takes approximately 30 s to perform 1000 encryption/decryption operations but only 48 ms to perform 1000 addition operations on ciphertexts. In our scheme, data providers can pre-upload their data to the CSP, and the encryption can be executed in the data providers’ spare time. Because the cost of ciphertext addition is low, our scheme is acceptable and feasible.

5.3 Communication Overhead

There are two phases that incur main communication costs, including data uploading and data analysis.

In the data uploading phase (see Fig. 7), the size of the message sent in our scheme is \(ndc+\frac{k}{1000}\) kb, whereas the message size in the straightforward scheme is kndp kb, where n, d, c, k, and p denote the size of the dataset, the number of attributes, the size of each ciphertext record, the number of types of noise to be added and the size of each plaintext record, respectively. As more different kinds of noise are added, the communication cost grows linearly in the straightforward scheme. Meanwhile, in our scheme, the size of the dataset increases to 88 kb after encryption when the original dataset size is 43 kb, indicating that encryption leads to a \(\times 2\) increase. Once there is more than one type of noise to be added, our scheme ensures a lower communication cost.

Fig. 8.
figure 8

Storage costs

In the data analysis phase, the total size of the message sent by the CSP is kndc, compared with kndp in the straightforward scheme. Generally speaking, therefore, the size of the message sent to the data evaluator in our scheme is approximately twice as large as that in the straightforward scheme. However, as shown in Fig. 8, the storage cost of our scheme is cheaper for both each data provider and the CSP. Storage cost for a data provider (see Fig. 8(a)): In our scheme, because the original dataset is uploaded, the data provider is required to store only the secret key for Paillier encryption (approximately 32 kb), which can be used to regain his data. By contrast, in the straightforward scheme, because of the lack of encryption (storage in plaintext poses privacy concerns), the data provider cannot reduce his storage cost for the original dataset. Storage cost for the CSP (see Fig. 8(b)): In our scheme, the CSP can store only one copy of the original dataset in encrypted form and regenerate a noisy dataset every time a data evaluator requests it. However, to preserve confidentiality, the CSP cannot possess the unencrypted original dataset with which to generate noisy datasets using the straightforward scheme; therefore, in this scheme, the CSP must store every noisy dataset, which will lead to a linear increase in storage cost with the number of types of noise to be added.

6 Conclusions

In this paper, we addressed the issues of inefficiency due to adding different types of noise to a dataset for differentially private publication. Specifically, to solve the challenge, we proposed an efficient and secure outsourced differential privacy scheme suitable for all DP algorithms with noise addition operations, including Laplace algorithm, in which the data providers incur reduced communication costs and are not required to be online when the access to their data is requested. We also showed how to use an independant noise server to deal with the differential privacy. The differential privacy for high-dimension data was also discussed. Finally, the experiment showed that the efficiency of our new scheme compared with the basic solutions. In future work, we will consider different kinds of noise-adding methods that can be executed over encrypted data. In addition, we are also interested in researching special encryption algorithms that permit smart and efficient noise addition.