Skip to main content
Log in

EAD: elasticity aware deduplication manager for datacenters with multi-tier storage systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The popularity of Big Data applications places pressures on storage systems to efficiently scale to meet the demand. At the same time, new developments like solid-state drives have changed to traditional storage hierarchy. Cloud storage systems are transitioning towards a hybrid architecture consisting of large amounts of memory, solid-state disks (SSDs), and traditional magnetic hard disks (HD). This paper presents elasticity aware deduplication (EAD), a data deduplication framework designed for multi-tier cloud storage architectures consisting of SSD and HD. EAD dynamically adjusts the deduplication parameters at runtime in order to improve performance. Experimental results indicate that EAD is able to detect more than 98% of all duplicate data, but it only consumes less than 5% of expected memory space. Additionally, EAD saves approximately 74% of overall IO access cost compared to the traditional design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Geer, D.: Reducing the storage burden via data deduplication. IEEE Trans. Comput. 41(12), 15–17 (2008)

    Google Scholar 

  2. Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, W., Tan, Y.: Design tradeoffs for data deduplication performance in backup workloads. In: 13th USENIX Conference on File and Storage Technologies (FAST 15), pp. 331–344 (2015)

  3. Fu, Y.J., Xiao, N., Liao, X.K., Liu, F.: Application-aware client-side data reduction and encryption of personal data in cloud backup services. J. Comput. Sci. Technol. (JCST) 28(6), 1012 (2013)

    Article  Google Scholar 

  4. Berliner, B.: Multi-tier cache and method for implementing such a system. US Patent 5,787,466 (1998)

  5. Spillane, R.P., Shetty, P.J., Zadok, E., Dixit, S., Archak, S.: An efficient multi-tier tablet server storage architecture. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, ACM, p. 1 (2011)

  6. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.:Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage Technologies (2009)

  7. Guo, F., Efstathopoulos, P.: Building a high performance deduplication system. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (2011)

  8. Adya, A., Bolosky, W.J., et al.: FARSITE: federated, available, and reliable storage for an incompletely trusted environment. ACM SIGOPS Oper. Syst. Rev. 36, 1–14 (2002)

    Article  Google Scholar 

  9. Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (2005)

  10. Manber, U., et al.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference (1994)

  11. Broder, A.Z.: Some applications of Rabins fingerprinting method. In: Sequences II, pp. 143–152. Springer, New York (1993)

  12. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)

  13. Dutch, M.: Understanding data deduplication ratios. In: SNIA Data Management Forum, p. 7 (2008)

  14. Kim, C., Park, K.W., et al.: Rethinking deduplication in cloud: From data profiling to blueprint. In: 2011 7th International Conference on Networked Computing and Advanced Information Management (NCM), pp. 101–104 (2011)

  15. Yang, Q., Ren, J.: I-CASH: intelligently coupled array of SSD and HDD. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA), pp. 278–289. IEEE (2011)

  16. Harnik, D., Margalit, O., Naor, D., Sotnikov, D., Vernik, G.: Estimation of deduplication ratios in large data sets. In: 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST) (2012)

  17. Hibler, M., Stoller, L.D., et al.: Fast, Scalable Disk Imaging with Frisbee. In: USENIX Annual Technical Conference, General Track (2003)

  18. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  19. Wikimedia Downloads Historical Archives. http://dumps.wikimedia.org/archive/. Accessed 04 2013

  20. OpenfMRI Datasets. https://openfmri.org/data-sets. Accessed 05 2013

  21. Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., Hsu, W.: Characteristics of backup workloads in production systems. In: FAST, vol. 12 (2012)

  22. Zhou, R., Liu, M., Li, T.: Characterizing the efficiency of data deduplication for big data storage management. In: 2013 IEEE International Symposium on Workload Characterization (IISWC), pp. 98–108. IEEE (2013)

  23. O’Neil, E., O’Neil, P., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of data, Washington, DC, pp. 297–306 (1993)

  24. Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5(2), 78–101 (1966)

    Article  Google Scholar 

  25. Megiddo, N., Modha, D.: ARC: a self-tuning, low overhead replacement cache. In: Proceedings of the 2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, pp. 115–130 (2003)

  26. Bansal, S., Modha, D.S.: CAR: clock with adaptive replacement. In: Proceedings of the 2th USENIX Conference on File and Storage Technologies, vol. 4, pp. 187–200 (2004)

  27. Sabaa, A., Kumar, P.D., et al.: Inline wire speed deduplication system. US Patent App. 12/797,032 (2010)

  28. You, L.L., Karamanolis, C.: Evaluation of efficient archival storage techniques. In: Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (2004)

  29. Kruus, E., Ungureanu, C., Dubnicki, C.: Bimodal content defined chunking for backup streams. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies (2010)

  30. Min, J., Yoon, D., Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60(6), 824–840 (2011)

    Article  MathSciNet  Google Scholar 

  31. Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35, pp. 174–187 (2001)

    Article  Google Scholar 

  32. Eshghi, K., Tang, H.K.: A framework for analyzing and improving content-based chunking algorithms, Hewlett-Packard Labs Technical Report TR (2005)

  33. Xia, W., Jiang, H.D., et al.: Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of USENIX Annual Technical Conference (2011)

  34. Debnath, B.K., Sengupta, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory. In: USENIX Annual Technical Conference (2010)

  35. Feng, J., Schindler, J.: A deduplication study for host-side caches in virtualized data center environments. In: 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–6. IEEE (2013)

  36. Li, C., Shilane, P., Douglis, F., Shim, H., Smaldone, S., Wallace, G.: Nitro: a capacity-optimized SSD cache for primary storage. In: 2014 USENIX Annual Technical Conference (USENIX ATC 14), pp. 501–512 (2014)

  37. Wang, Y., Tan, C.C., Mi, N.: Using elasticity to improve inline data deduplication storage systems. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp. 785–792. IEEE (2014)

  38. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies (2008)

  39. Lu, G., Jin, Y., Du, D.H.: Frequency based chunking for data de-duplication, In: 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. MASCOTS, pp. 287–296. IEEE (2010)

  40. Kampe, M., Stenstrom, P., Dubois, M.: Self-correcting LRU Replacement Policies. In: Proceedings of the 1st Conference on Computing frontiers, Ischia, Italy, pp. 181–191 (2004)

  41. Johnson, T., Shasha, D.: 2Q: a low overhead high performance buffer management replacement algorithm. In: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, pp. 439–450 (1994)

  42. Zhou, Y., Philbin, J., Li, K.: The multi-queue replacement algorithm for second level buffer caches. In: Proceedings of the 2001 USENIX Annual Technical Conference, Boston, MA, pp. 91–104 (2001)

  43. Lee, D., Choi, J., Kim, J.H., Noh, S., Min, S.L., Cho, Y., Kim, C.S.: LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Comput. 50(12), 1352–1361 (2001)

    Article  MathSciNet  Google Scholar 

  44. Guerra, J., Pucha, H., Glider, J., Belluomini, W., Rangaswami, R.: Cost effective storage using extent based dynamic tiering. In: Proceedings of the 9th USENIX Conference on File and Storage Technologies, San Jose, CA (2011)

  45. Tai, J., Liu, D., Yang, Z., Zhu, X., Lo, J., Mi, N.: Improving flash resource utilization at minimal management cost in virtualized flash-based storage systems. IEEE Trans. Cloud Comput. 5(3), 537–549 (2017)

    Article  Google Scholar 

  46. Yang, Z., Awasthi, M., Ghosh, M., Mi, N.: A fresh perspective on total cost of ownership models for flash storage in datacenters. In: 2016 IEEE 8th International Conference on Cloud Computing Technology and Science. IEEE (2016)

  47. Yang, Z., Ghosh, M., Awasthi, M., Balakrishnan, V.: Online flash resource allocation manager based on TCO model (2016)

  48. Yang, Z., Ghosh, M., Awasthi, M., Balakrishnan, V.: Online flash resource migration, allocation, retire and replacement manager based on a cost of ownership model (2016)

  49. Roemer, J., Groman, M., Yang, Z., Wang, Y., Tan, C.C., Mi, N.: Improving virtual machine migration via deduplication. In: 2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems, pp. 702–707. IEEE (2014)

  50. Bhimani, J., Yang, J., Yang, Z., Mi, N., Xu, Q., Awasthi, M., Pandurangan, R., Balakrishnan, V.: Understanding performance of I/O intensive containerized applications for NVMe SSDs. In: 35th IEEE International Performance Computing and Communications Conference (IPCCC). IEEE d(2016)

  51. Farokhi, S., Jamshidi, P., Lakew, E.B., Brandic, I., Elmroth, E.: A hybrid cloud controller for vertical memory elasticity: a control-theoretic approach. Future Gener. Comput. Syst. 65, 57–72 (2016)

    Article  Google Scholar 

  52. Tang, H., Cui, Y., Guan, C., Wu, J., Weng, J., Ren, K.: Enabling Ciphertext Deduplication for Secure Cloud Storage and Access Control. In: Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, ACM, pp. 59–70 (2016)

  53. Nicolae, B.: Towards scalable checkpoint restart: A collective inline memory contents deduplication proposal. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 19–28. IEEE (2013)

Download references

Acknowledgements

This research was supported in part by National Science Foundations grants CNS-1527346, CNS-1618398, and CNS-1452751, and AFOSR grant FA9550-14-1-0160.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengyu Yang.

Appendix: Proof and error analysis of expectation of duplication ratio (EDR)

Appendix: Proof and error analysis of expectation of duplication ratio (EDR)

Real-world workloads in enterprise environments have different I/O behaviors, but their pattern can be regresses to some long-term predicable streams. This can be regarded as a super set of number of \(n_s\) homogeneous T-size segments. This stream can be modeled by an average segment (Seg), such that any results which holds on this average segment also holds for the entire stream, i.e., \(EDR_{ Cur }(Stream) \approx DR_{ Cur }(Stream)\)

$$\begin{aligned} \lim _{ {n_s} \rightarrow \infty }{ DR_{ Cur }(Stream) =DR(Seg) }. \end{aligned}$$
(1)

Therefore, we can reduce the problem from “level 3” to “level 2”. In this section, we demonstrate how EDR(Seg) is close to \(DR_(Seg)\) during runtime. We first make the following definitions:

Table 3 Four cases with different sampling degrees and dupSet sizes

Definition 1

dupSet We first define duplicated set (“dupSet”) as the set of same-value chunks in a segment. For example, in Fig. 18 case 2, there are four dupSets, i.e.,

$$\begin{aligned} dupSet_1=\{1,1,\}, dupSet_2=\{2,2,2,2,2,2\},\nonumber \\ dupSet_3=\{3,3,3\}, dupSet_4=\{4,4,4,4,4\}. \end{aligned}$$
(2)

Definition 2

Fully and partial sampled segments For each dupSet \(d_i\) among T chunks in a segment, if at least one chunk per dupSet is sampled in \({\mathbb {B}}\), then this segment is considered fully sampled; otherwise it is partially sampled. Note that the worst case of the partially sampled segment is non-samples, which will happen if \(\kappa >0\). Based on this definition, we can divide the problem into four cases as shown in Table 3. Fig. 18 also shows examples for each case. Later we prove that Case 1 and 3 are accurate, while Case 2 and 4 are with known estimation errors. Here we also show the probabilities of each cases. Given the size of each segment T, size of each dupSet \(d_i\), sample number \(\kappa (\ge n)\), and total number of dupSets n, the probability that a segment is fully sampled is:

$$\begin{aligned} P_{ FullSmp }=\frac{ \prod _{ i=1 }^{ n }{ ({ C }_{ 1 }^{ d_{ i } }) } }{ { C }_{ \kappa }^{ T } } =\frac{ ({ C }_{ 1 }^{ d })^{ n } }{ { C }_{ \kappa }^{ T } } =\frac{ d^{ n } }{ { C }_{ \kappa }^{ T } } , \end{aligned}$$
(3)

and the probability that a segment is partial sampled is:

$$\begin{aligned} P_{ PartSmp }=1-\frac{ d^{ n } }{ { C }_{ \kappa }^{ T } } . \end{aligned}$$
(4)

Definition 3

Estimation error To evaluate the estimation accuracy, we define the estimation error \(\delta\) which is the distance between EDR and DR:

$$\begin{aligned} \delta =\left| EDR(Seg)-DR(Seg)\right| , \end{aligned}$$
(5)

where \(\delta \in [0, +\infty )\), and the less \(\delta\) the more accuracy EDR is. When \(\delta =0\), the estimation is fully accurate. Based on these assumption and definitions, we now calculate the \(\delta\) of each case:

[Case 1] Fully sampled, equal DupSet size

We first calculate the actual \(DR_{Cur}\) of a segment:

$$\begin{aligned} DR_(Seg) = 1 - \frac{n}{T}=1-\frac{1}{d}. \end{aligned}$$
(6)

We then investiage the EDR. Since the “fully sampled” case ensures there is at least one sample per dupSet, we can denote \(\kappa =n+\varepsilon\), where \(\varepsilon\) (\(\in [0,T-n])\) is the redundant samples that are duplicated with existing one sample per dupSet. We further let each dupSet \(d_i\) being assigned \(\kappa _i=1+\varepsilon _i\) samples, where \(\varepsilon _i\) is the number of redundant samples of \(d_i\), and \(\sum _{ i\in {\mathbb {B}}}^{ }{\varepsilon _i}=\varepsilon\). Fig. 18 Case 1 illustrates an example, where \(\kappa _3=1+\varepsilon _3=1+1=2\). Therefore, EDR(Seg) can be calculated as:

$$\begin{aligned} EDR(Seg) =1-\frac{ 1 }{ \kappa } \sum _{ i\in { { { \mathbb {B} } } } }^{ }{ \left ( \frac{h_{Smp_i}}{h_{Seg_i}}\right)} =1-\frac{ 1 }{ \kappa } \left(\sum _{ i\in {\mathbb {B}} }^{ }{ \frac{ 1 }{ d_{ i } } } +\sum _{ i\in {\mathbb {B}} }^{ }{ \frac{ \varepsilon _{ i } }{ d_{ i } } } \right). \end{aligned}$$
(7)

Since in Case 1 all dupSets have the same size (\(d_i=d\)), Eq. 7 can be simplified as:

$$\begin{aligned} lhs=1-\frac{ 1 }{ \kappa } \left(\frac{ 1 }{ d } \cdot n+\frac{ 1 }{ d } \cdot \varepsilon \right) =1-\frac{ 1 }{ \kappa d } (n+\varepsilon ) =1-\frac{ \kappa }{ \kappa d } =1-\frac{ 1 }{ d } . \end{aligned}$$
(8)

δ = 0, which proves that our EDR can reflect the duplication ratio of the accumulated workload with 100% accuracy in Case 1.

[Case 2] Fully sampled, non-equal DupSetSize

Fig. 18
figure 18

Examples of different sampling scenarios

It is straightforward to get the real \(DR_(Seg)\) of one segment as:

$$\begin{aligned} DR(Seg)=1-\frac{ n }{ T } =1-\frac{ 1 }{ \bar{ d } } . \end{aligned}$$
(9)

However, to calculate EDR, we need to divide the “fully sampled” case into two sub cases: (2.1) Exact fully sampled: each dupSet has exactly one sample in \({\mathbb {B}}\), and \(\kappa = n\); and (2.2) Redundantly fully sampled: at least one dupSet has more than one samples in \({\mathbb {B}}\), i.e., \(\forall \kappa _i \ge 1, i \in [1,n]\) and \(\kappa > n\), where \(\kappa _i\) is the sample number of \(d_i\).

[Case 2.1] Exact fully sampled

Since \(\kappa = n\) and \(\kappa _i=1\), we have:

$$\begin{aligned} EDR(Seg)=1-\frac{ 1 }{ \kappa } \sum _{ i\in {\mathbb {B} }}^{ }{ \left(\frac{ h_{Smp_i} }{ h_{Seg_i}} \right) } =1-\frac{ 1 }{ n } \left(\sum _{ i=1 }^{ n }{ \frac{ 1 }{ d_{ i } } } \right). \end{aligned}$$
(10)

Based on Eqs. 9 and 10, the estimation error is:

$$\begin{aligned} \delta & =\left| EDR(Seg)-DR(Seg)\right| =\left| \frac{ 1 }{ \bar{ d } } -\frac{ 1 }{ n } \left(\sum _{ i=1 }^{ n }{ \frac{ 1 }{ d_{ i } } } \right)\right| \nonumber \\ & =\left| \frac{ 1 }{ A(D) } -\frac{ 1 }{ H(D) } \right| =\frac{ 1 }{ H(D) }-\frac{ 1 }{ A(D) }. \end{aligned}$$
(11)

In Eq. 11, we use notation A(D) and H(D) to represent the arithmetic mean and harmonic mean of sample set \(D=\{d_i|d_i \in Seg\}\) respectively. It is always true that \(0 \le A(D) \le H(D)\), so we remove the absolute value sign. Aiming for better performance, we further investigate under what case \(\delta\) can be minimized. We conduct several experiments tuning dupSet’s sizes under the “exact fully sampled in non-equal dupSet” case.

Figure 19 shows seven representative \(\delta\) curves with different size of one subSet (subSet A). There are three dupSets A,B,C in the segment with size of T. For each curve, we iterate the size of dupSet B. Obviously dupSet C’s size is \((T-|A|-|B|)\). We observe that (1) when the remaining two dupSets (B and C) have exactly same sizes, the \(\delta\) will be the lowest of that curve; and (2) when all of three dupSets have the same sizes, the \(\delta\) is the global lowest among all curves. That is to say, the more equalikely of size of each dupSet is, the lower estimation error will be.

Fig. 19
figure 19

\(\delta\) of a three-dupSet segment with different dupSet sizes

Fig. 20
figure 20

\(\delta\) of three-dupSet segment with different redundant samples

[Case 2.2] Redundantly fully sampled

All dupSets are sampled and some of them have more than one sample, i.e., \(\kappa =n+\varepsilon\). We have:

$$\begin{aligned} EDR(Seg)=1-\frac{ 1 }{ \kappa } \sum _{ i\in { { \mathbb {B} } } }^{ }{ \left(\frac{ h_{ Smp_{ i } } }{ h_{ Seg_{ i } } } \right) } =1-\frac{ 1 }{ n+\varepsilon } \left(\sum _{ i\in {\mathbb {B}} }^{ }{ \frac{ 1+\varepsilon _{ i } }{ d_{ i } } } \right). \end{aligned}$$
(12)

The estimation error is:

$$\begin{aligned} \delta =\left| EDR(Seg)-DR(Seg)\right| =\left| \frac{ 1 }{ \bar{ d } } -\frac{ 1 }{ n+\varepsilon } \left(\sum _{ i\in {\mathbb {B}} }^{ }{ \frac{ 1+\varepsilon _{ i } }{ d_{ i } } } \right)\right| . \end{aligned}$$
(13)

Here we introduce a super set \(D_{Sup}\), which is extension of D, i.e., \(D_{Sup}=\{d_i|x_j \in {\mathbb {B}}, {x_j}\in d_i\}\). For example, in Fig. 18 Case 2, \(D=\{1,2,3,4\}\) and \(D_{Sup}=\{1,2,3,3,4\}\). Therefore, we can use the harmonic mean of \(D_{Sup}\) to help to represent the second part of Eq. 13. We further use \(\Delta (D_{Sup}) = \sum _{ i \in {\mathbb {B}} }^{ }{ (\varepsilon _i \cdot d_i })\), to represent the total size of dupSets that those redundant samples are associated with, where \(\varepsilon _i\) is the number of redundant samples from \(d_i\). For example, in Fig. 18 Case 2, we have \(\Delta (D_{Sup}) = 1 \times 3 =3\). Therefore, Eq. 13 equals to:

$$\begin{aligned} \delta =\left| \frac{ 1 }{ A(D) } -\frac{ 1 }{ H(D_{Sup}) } \right| \end{aligned}$$
$$\begin{aligned} & = \left| \left[\frac{ 1 }{ A(D_{Sup}) } -\frac{ 1 }{ H(D_{Sup}) }\right] +\left[\frac{ 1 }{ A(D) }-\frac{ 1 }{ A(D_{Sup}) } \right]\right| \nonumber \\ & =\left| \left[\frac{ 1 }{ A(D_{Sup}) } -\frac{ 1 }{ H(D_{Sup}) }\right] +\left[\frac{ n }{ T }-\frac{ n+\varepsilon }{ T+\Delta (D_{Sup})}\right] \right| . \end{aligned}$$
(14)

As shown in Eq. 14, the non-equal case estimation error comes from two parts: (1) the difference between arithmetic and harmonic mean (same as Eq. 11); and (2) the number of picked samples which are not necessarily proportional to each corresponding dupSets’ sizes (i.e., dupSets’ weights in the segment). To further investigate \(\delta\), Fig. 20 shows the relationship between \(\delta\) and different number and distribution of redundant samples in a three-non-equal-dupSet segment example. In this experiment, \(|dupSetA|=10\%T\), \(|dupSetB|=30\),and \(|dupSetC|=60\%T\). We increase the number of samples from each dupSet in different orders. For example, curve with “10, 30, 60” means that firstly each dupSet has one sample, and then we keep adding one more sample of \(10\%T\) dupSet until it is fully indexed. Later we repeat it for \(30\%T\) and \(60\%T\) dupSets until the entire segment is fully indexed. We observe that, if we pick lots of samples from \(10\%T\) dupSet at beginning, the error will reach the worst. More the samples from one dupSet, more the weight of that dupSet will be in the final estimation. It will be more accurate if the number of picked samples are proportional to each dupSet, or from a dupSet who shares a relatively big fraction of the segment. Notice that, we also conduct experiments where picking samples with random or some distributions, and consecutively we end up with the same conclusion. Thus, we show part of results to demonstrate the bounds of \(\delta\).

[Case 3] Partially sampled, equal DupSetSize

Let \(T_S\) be the sampled part of a segment, \(n'\) be the number of sampled dupSets, and \(\varepsilon '\) be the total redundant samples from those sampled dupSets. We have \(n'+\varepsilon '=\kappa\). Then EDR of the segment is:

$$\begin{aligned} EDR(Seg)=1-\frac{ 1 }{ \kappa } \sum _{ i\in { { { \mathbb {B} } } } }^{ }{ \left(\frac{ h_{ Smp_{ i } } }{ h_{ Seg_{ i } } } \right) } \end{aligned}$$
$$\begin{aligned} =1-\frac{ 1 }{ n'+\varepsilon ' } \left(\sum _{ i\in { \mathbb {B}(T_{ S }) } }^{ }{ \frac{ 1 }{ d_{ i } } } +\sum _{ i\in { \mathbb {B}(T_{ S }) } }^{ }{ \frac{ \varepsilon '_{ i } }{ d_{ i } } } \right) \end{aligned}$$
(15)
$$\begin{aligned} =1-\frac{ 1 }{ n'+\varepsilon ' } \left(\frac{ n' }{ d } +\frac{ \varepsilon ' }{ d } \right) =1-\frac{ 1 }{ d } = DR(Seg). \end{aligned}$$
(16)

Therefore, like Case 1, EDR can 100% reflect the DR(Seg) in Case 3.

$$\begin{aligned} \delta =\left| EDR(Seg)-DR(Seg)\right| =0. \end{aligned}$$
(17)

[Case 4] Partially sampled, non-equal DupSetSize

Similar to Case 2, the \(\delta\) can be calculated as:

$$\begin{aligned} \delta =\left| EDR(Seg)-DR(Seg) \right| =\left| \frac{ 1 }{ \bar{ d } } -\frac{ 1 }{ n'+\varepsilon ' } \left(\sum _{ i\in { B } }^{ }{ \frac{ 1+\varepsilon '_{ i } }{ d_{ i } } } \right) \right| . \end{aligned}$$
(18)

Also, Eq. 18 can further be divided into two sub cases:

[Case 4.1] Exact partially sampled

To differentiate with fully sampled case, we use the notation \({\mathbb {B(T_S)}}\) to represent the estimation base that has partial samples of all dupSets (\(T_S\) of segment T and \(T_S\) is a subset of T). Similar to Case 2.1, we let \({ { D } }_{ }^{ T_{ S } }=\{d_i|i \in \mathbb {B(T_S)}\}\) be the sampled dupSet’s set. Since each sampled dupSet has only one sample, so \(\varepsilon '=0\), and:

$$\begin{aligned} \delta =\left| \frac{ 1 }{ A(D) } -\frac{ 1 }{ H({ { D } }_{ }^{ T_{ S } }) } \right| \end{aligned}$$
$$\begin{aligned} =\left| \left[\frac{ 1 }{ A({ { D } }_{ }^{ T_{ S } }) } -\frac{ 1 }{ H({ { D } }_{ }^{ T_{ S } }) } \right]+\left[\frac{ 1 }{ A(D) } -\frac{ 1 }{ A({ { D } }_{ }^{ T_{ S } }) } \right] \right| . \end{aligned}$$
(19)

[Case 4.2] Redundantly partially sampled

Similar to Case 2.2, let \({ { D } }_{ Sup }^{ T_{ S } }=\{d_i|x_j \in {\mathbb {B(T_S)}}, {x_j}\in d_i\}\) represent the super set of \({ { D } }_{ }^{ T_{ S } }\). For those sampled dupSets, there exist redundant samples, i.e., \(\varepsilon '\ne 0\), as a result we have:

$$\begin{aligned} \delta =\left| \frac{ 1 }{ A(D) } -\frac{ 1 }{ H({ { D } }_{ Sup }^{ T_{ S } }) } \right| \end{aligned}$$
$$\begin{aligned} =\left| \left[\frac{ 1 }{ A({ { D } }_{ Sup }^{ T_{ S } }) } -\frac{ 1 }{ H({ { D } }_{ Sup }^{ T_{ S } }) } \right]+\left[\frac{ 1 }{ A(D) } -\frac{ 1 }{ A({ { D } }_{ Sup }^{ T_{ S } }) } \right] \right| . \end{aligned}$$
(20)

We can see from Eqs. 19 and 20, the estimation error of partially sampled case comes from two source: arithmetic and harmonic means difference, and using partial dupSets to estimate the entire segment. The former has already been explained in Case 2, and the latter is simply based on the coverage of \({\mathbb {B}}(T_S)\) over the entire segment.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Wang, Y., Bhamini, J. et al. EAD: elasticity aware deduplication manager for datacenters with multi-tier storage systems. Cluster Comput 21, 1561–1579 (2018). https://doi.org/10.1007/s10586-018-2141-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-2141-z

Keywords

Navigation