skip to main content
research-article

I/O Deduplication: Utilizing content similarity to improve I/O performance

Published:28 September 2010Publication History
Skip Abstract Section

Abstract

Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads showed an overall improvement in disk I/O performance of 28 to 47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.

References

  1. }}Akyurek, S. and Salem, K. 1995. Adaptive block rearrangement. Comput. Syst. 13, 2, 89--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Axboe, J. 2007. Blktrace User Guide.Google ScholarGoogle Scholar
  3. }}Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., and Hristidis, V. 2009. BORG: BlockreORGanization for self-optimizing storage systems. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Bitton, D. and Gray, J. 1988. Disk shadowing. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Bloom, B. H. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Brin, S., Davis, J., and Garcia-Molina, H. 1995. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Conference. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Clements, A., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Ellard, D., Ledlie, J., Malkani, P., and Seltzer, M. 2003. Passive NFS tracing of email and research workloads. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}EMC Corporation. EMC Invista. http://www.emc.com/products/software/invista/invista.jsp.Google ScholarGoogle Scholar
  10. }}Gill, B. S. 2008. On multi-level exclusive caching: offline optimality and why promotions are better than demotions. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Gray, J. and Shenoy, P. 2000. Rules of thumb in data engineering. In Proceedings of the IEEE International Conference on Data Engineering. IEEE, Wshington, D.C. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Guerra, J., Useche, L., Bhadkamkar, M., Koller, R., and Rangaswami, R. 2008. The case for active block layer extensions. ACM Oper. Syst. Rev. 42, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Varghese, G., Voelker, G., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}Hsu, W. W., Smith, A. J., and Young, H. C. 2005. The automatic improvement of locality in storage systems. ACM Trans. Comput. Syst. 23, 4, 424--473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Huang, H., Hung, W., and Shin, K. G. 2005. FS2: Dynamic data replication in free disk space for improving disk performance and energy consumption. In Proceedings of the ACM SOSP. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}IBM Corporation. IBM system storage SAN volume controller. http://www03.ibm.com/systems/storage/software/virtualization/svc/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}Jain, N., Dahlin, M., and Tewari, R. 2005. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the USENIX Conference on File and Storage Systems. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}Jiang, S., Chen, F., and Zhang, X. 2005. Clock-pro: An effective improvement of the clock replacement. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. }}Kulkarni, P., Douglis, F., LaVoie, J. D., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. }}Leung, A., Pasupathy, S., Goodson, G., and Miller, E. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. }}Li, X., Aboulnaga, A., Salem, K., Sachedina, A., and Gao, S. 2005. Second-tier cache management using write hints. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. }}Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. }}Megiddo, N. and Modha, D. S. 2003. Arc: A self-tuning, low overhead replacement cache. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. }}Milos, G., Murray, D. G., Hand, S., and Fetterman, M. 2009. Satori: Enlightened page sharing. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. }}Morrey, C. B., III, and Grunwald, D. 2003. Peabody: The time travelling disk. In Proceedings of the IEEE/NASA MSST. IEEE, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. }}Muthitacharoen, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the ACM SOSP. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. }}Network Appliance, Inc. NetApp V-series of heterogeneous storage environments. http://media.netapp.com/documents/v-series.pdf.Google ScholarGoogle Scholar
  29. }}Orji, C. U. and Solworth, J. A. 1993. Doubly distorted mirrors. In Proceedings of the ACM SIGMOD. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. }}Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival storage. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. }}Rhea, S., Cox, R., and Pesterev, A. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. }}Ruemmler, C. and Wilkes, J. 1991. Disk shuffling. Tech. rep. HPL-CSP-91-30, Hewlett-Packard Laboratories.Google ScholarGoogle Scholar
  33. }}Solworth, J. A. and Orji, C. U. 1991. Distorted mirrors. In Proceedings of the 1st International Conference on Parallel and Distributed Information Systems (PDIS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. }}Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., and Bressoud, T. 2003. Opportunistic use of content addressable storage for distributed file systems. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA.Google ScholarGoogle Scholar
  35. }}Vongsathorn, P. and Carson, S. D. 1990. A system for adaptive disk rearrangement. Softw. Pract. Exper. 20, 3, 225--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. }}Waldspurger, C. A. 2002. Memory resource management in VMware ESX server. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. }}Wong, C. K. 1980. Minimizing expected head movement in one-dimensional and two-dimensional mass storage systems. ACM Comput. Surv. 12, 2, 167--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. }}Wong, T. M. and Wilkes, J. 2002. My cache or yours? Making storage more exclusive. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. }}Yu, X., Gum, B., Chen, Y., Wang, R. Y., Li, K., Krishnamurthy, A., and Anderson, T. E. 2000. Trading capacity for performance in a disk array. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. }}Zhang, C., Yu, X., Krishnamurthy, A., and Wang, R. Y. 2002. Configuring and scheduling an eager-writing disk array for a transaction processing workload. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. }}Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. I/O Deduplication: Utilizing content similarity to improve I/O performance

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Storage
              ACM Transactions on Storage  Volume 6, Issue 3
              September 2010
              165 pages
              ISSN:1553-3077
              EISSN:1553-3093
              DOI:10.1145/1837915
              Issue’s Table of Contents

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 28 September 2010
              • Accepted: 1 June 2010
              • Revised: 1 May 2010
              • Received: 1 April 2010
              Published in tos Volume 6, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader