ABSTRACT
Distributed storage systems often triplicate data to reduce the risk of permanent data loss, thereby tolerating at least two simultaneous disk failures at the price of 2/3 of the capacity. To reduce this price, some systems utilize erasure coding. But this optimization is usually only applied to cold data, because erasure coding might hinder performance for warm data.
We propose RAIDP---a new point in the distributed storage design space between replication and erasure coding. RAIDP maintains only two replicas, rather than three or more. It increases durability by utilizing small disk "add-ons" for storing intra-disk erasure codes that are local to the server but fail independently from the disk. By carefully laying out the data, the add-ons allow RAIDP to recover from simultaneous disk failures (add-ons can be stacked to withstand an arbitrary number of failures). RAIDP retains much of the benefits of replication, trading off some performance and availability for substantially reduced storage requirements, networking overheads, and their related costs. We implement RAIDP in HDFS, which triplicates by default. We show that baseline RAIDP achieves performance close to that of HDFS with only two replicas, and performs within 21% of the default triplicating HDFS with an update-oriented variant, while halving the storage and networking overheads and providing similar durability.
- DRAMeXchnage website. http://www.dramexchange.com/. (Accessed: Nov 2019).Google Scholar
- Raspberry Pi Zero. https://www.raspberrypi.org/products/raspberry-pi-zero/. (Accessed: Nov 2019).Google Scholar
- Viking Technology. http://www.vikingtechnology.com/.Google Scholar
- A review of emerging non-volatile memory (NVM) technologies and applications. Solid-State Electronics 125 (2016), 25 -- 38.Google ScholarCross Ref
- Aguilera, M. K., and Janakiraman, R. Using erasure codes efficiently for storage in a distributed system. In In Proc. of DSN'05 (2005), IEEE Computer Society, pp. 336--345.Google ScholarDigital Library
- Amazon Web Services (AWS). Amazon S3 product details. http://aws.amazon.com/s3/details.Google Scholar
- Amazon.com. USB 2.0 to SATA + IDE cable adapter. http://www.amazon.com/USB-SATA-5-25-Cable-Adapter/dp/B000YJBL78, 2014. (Accessed: Nov 2019).Google Scholar
- Amazon.com. Seagate 4TB desktop HDD SATA 6Gb/s 64MB cache 3.5-inch internal bare drive (ST4000DM000). http://www.amazon.com/Seagate-Desktop-3-5-Inch-Internal-ST4000DM000/dp/B00B99JU4S, 2015. (Accessed: Nov 2019).Google Scholar
- Apache Software Foundation, T. HDFS architecture guide. http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html. (Accessed: Nov 2019).Google Scholar
- Arnold, J. Erasure codes with OpenStack Swift - digging deeper. https://swiftstack.com/blog/2013/07/17/erasure-codes-with-openstack-swift-digging-deeper/, July 2013. (Accessed: Nov 2019).Google Scholar
- B. Calder et. al. Windows Azure Storage: A highly available cloud storage service with strong consistency. In ACM Symp. on Operating Systems Principles (SOSP) (2011), pp. 143--157.Google ScholarDigital Library
- Beaver, D., Kumar, S., Li, H. C., Sobel, J., Vajgel, P., et al. Finding a needle in Haystack: Facebook's photo storage. In OSDI (2010), vol. 10, pp. 1--8.Google ScholarDigital Library
- Bhandarkar, D. Watt matters in energy efficiency. Server design summit keynote 2012, http://www.slideshare.net/dileepb/server-design-summit-keynote-handout, 2012.Google Scholar
- Borthakur, D. HDFS architecture guide. The Apache Software Foundation, 2008.Google Scholar
- Ceph Documentation. Ceph erasure coded pools. https://docs.ceph.com/docs/giant/dev/erasure-coded-pool/. (Accessed: Nov 2019).Google Scholar
- Ceph Documentation. Ceph storage cluster. https://docs.ceph.com/docs/master/rados/. (Accessed: Nov 2019).Google Scholar
- Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (June 2008), 4:1--4:26.Google ScholarDigital Library
- Cidon, A., Escriva, R., Katti, S., Rosenblum, M., and Sirer, E. G. Tiered replication: A cost-effective alternative to full cluster geo-replication. In USENIX Annual Technical Conference (2015), USENIX ATC.Google Scholar
- Cidon, A., Rumble, S., Stutsman, R., Katti, S., Ousterhout, J., and Rosenblum, M. Copysets: Reducing the frequency of data loss in cloud storage. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13) (San Jose, CA, 2013), USENIX, pp. 37--48.Google Scholar
- Dimakis, A. G., Godfrey, P. B., Wu, Y., Wainwright, M. J., and Ramchandran, K. Network coding for distributed storage systems. IEEE Trans. Inf. Theor. 56, 9 (Sept. 2010), 4539--4551.Google ScholarDigital Library
- Dragojević, A., Narayanan, D., Nightingale, E. B., Renzelmann, M., Shamis, A., Badam, A., and Castro, M. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 54--70.Google Scholar
- Facebook Engineering. HDFS-RAID. https://engineering.fb.com/core-data/saving-capacity-with-hdfs-raid/. (Accessed: Nov 2019).Google Scholar
- Fan, B., Tantisiriroj, W., Xiao, L., and Gibson, G. DiskReduce: RAID for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage (2009), ACM, pp. 6--10.Google ScholarDigital Library
- Fang, J., Wan, S., and He, X. RAFI: Risk-aware failure identification to improve the RAS in erasure-coded data centers. In 2018 USENIX Annual Technical Conference (2018), USENIX ATC.Google Scholar
- Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (2010), USENIX Association, pp. 1--7.Google Scholar
- Ford Jr, L., and Fulkerson, D. Maximal flow through a network. Canadian Journal of Mathematics 8 (1956), 399--404.Google ScholarCross Ref
- Friedman, R., Kantor, Y., and Kantor, A. Replicated erasure codes for storage and repair-traffic efficiency. In Peer-to-Peer Computing (P2P), 14-th IEEE International Conference on (Sept 2014), pp. 1--10.Google Scholar
- Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google File System. In ACM Symp. on Operating Systems Principles (SOSP) (2003), pp. 29--43.Google Scholar
- Gibson, G., Hellerstein, L., Karp, R., Katz, R., and Patterson, D. Coding techniques for handling failures in large disk arrays. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (1989), pp. 123--32.Google Scholar
- Greenan, K. M., Miller, E. L., Schwarz, T. J. E., and Long, D. D. Disaster recovery codes: Increasing reliability with large-stripe erasure correcting codes. In Proceedings of the 2007 ACM Workshop on Storage Security and Survivability (New York, NY, USA, 2007), StorageSS '07, ACM, pp. 31--36.Google ScholarDigital Library
- Greenberg, A., Hamilton, J., Maltz, D. A., and Patel, P. The cost of a cloud: Research problems in data center networks. SIGCOMM Comput. Commun. Rev. 39, 1 (Dec. 2008), 68--73.Google ScholarDigital Library
- Hafner, J. HoVer erasure codes for disk arrays. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on (June 2006), pp. 217--226.Google ScholarDigital Library
- Hamilton, J. Overall data center costs. Perspectives, http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/, 2010.Google Scholar
- Hamilton, J. Why scale matters and how the cloud really is different. Re:invent 2013, https://www.youtube.com/watch?v=WBrNrI2ZsCo, 2013.Google Scholar
- Holland, M., and Gibson, G. A. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 1992), ASPLOS V, ACM, pp. 23--35.Google ScholarDigital Library
- Hopcroft, J., and Karp, R. An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on Computing 2, 4 (1973), 225--231.Google ScholarCross Ref
- Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., Yekhanin, S., et al. Erasure coding in Windows Azure Storage. In USENIX conference on Annual Technical Conference, USENIX ATC (2012).Google Scholar
- Hwang, K., Jin, H., and Ho, R. RAID-x: A new distributed disk array for I/O-centric cluster computing. In Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing (Washington, DC, USA, 2000), HPDC '00, IEEE Computer Society, pp. 279--286.Google ScholarCross Ref
- Intel. Power management in Intel architecture servers., http://download.intel.com/support/motherboards/server/sb/power_management_of_intel_architecture_servers.pdf, 2009.Google Scholar
- Intel. HiBench - the Hadoop Benchmark Suite. https://github.com/intel-hadoop/HiBench, 2015. Github. Accessed Feb 2015.Google Scholar
- Khan, O., Burns, R., Plank, J. S., Pierce, W., and Huang, C. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In USENIX Conf. on File & Storage Technologies (FAST) (San Jose, February 2012).Google Scholar
- Kim, J., Lee, E., and Noh, S. H. I/o scheduling schemes for better I/O proportionality on flash-based SSDs. In IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (2016), MASCOTS, pp. 221--230.Google ScholarCross Ref
- Kuang, H. Support hsync in HDFS [issue: Hdfs-744]. https://issues.apache.org/jira/browse/HDFS-744, May 2012. Hadoop HDFS JIRA issue tracker (Accessed: November 2019).Google Scholar
- Kuhn, H. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (2006), 83--97.Google Scholar
- Lakshman, A., and Malik, P. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (Apr. 2010), 35--40.Google ScholarDigital Library
- Lee, E. K., and Thekkath, C. A. Petal: Distributed virtual disks. SIGOPS Oper. Syst. Rev. 30, 5 (Sept. 1996), 84--92.Google ScholarDigital Library
- Li, R., Li, X., Lee, P. P. C., and Huang, Q. Repair pipelining for erasure-coded storage. In USENIX Annual Technical Conference (2017), USENIX ATC.Google Scholar
- MacWilliams, F. J., and Sloane, N. J. A. The theory of error-correcting codes, vol. 16. Elsevier, 1977.Google Scholar
- Manasse, M. S., Thekkath, C. A., and Silverberg, A. A Reed-Solomon code for disk storage, and efficient recovery computations for erasure-coded disk storage. Proceedings in Informatics (July 2005).Google Scholar
- Marvell. Marvell 88SS5000 NVMeoF SSD controller shown with Toshiba BiCS. https://www.servethehome.com/marvell-88ss5000-nvmeof-ssd-controller-shown-with-toshiba-bics. (Accessed: Feb. 2020).Google Scholar
- Microsemi. Flashtec NVMe controllers. https://www.microsemi.com/product-directory/storage/3687-flashtec-nvme-controllers. (Accessed: Nov 2019).Google Scholar
- Mills-Tetty, G. A., Stentz, A. T., and Dias, M. B. The dynamic hungarian algorithm for the assignment problem with changing costs. Tech. Rep. CMU-RI-TR-07-27, Robotics Institute, Pittsburgh, PA, July 2007.Google Scholar
- Mitra, S., Panta, R., Ra, M.-R., and Bagchi, S. Partial-parallel-repair (PPR): A distributed technique for repairing erasure coded storage. In Proceedings of the Eleventh European Conference on Computer Systems (2016), EuroSys.Google ScholarDigital Library
- Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems (TODS) 17, 1 (1992), 94--162.Google ScholarDigital Library
- Muralidhar, S., Lloyd, W., Roy, S., Hill, C., Lin, E., Liu, W., Pan, S., Shankar, S., Sivakumar, V., Tang, L., et al. f4: Facebook's warm BLOB storage system. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation (2014), USENIX Association, pp. 383--398.Google Scholar
- Nightingale, E. B., Elson, J., Fan, J., Hofmann, O., Howell, J., and Suzue, Y. Flat datacenter storage. In USENIX Symp. on Operating Systems Design & Implementation (OSDI) (2012), pp. 1--15.Google Scholar
- Pamies-Juarez, L., Blagojević, F., Mateescu, R., Gyuot, C., Gad, E. E., and Bandić, Z. Opening the chrysalis: On the real repair performance of msr codes. In 14th USENIX Conference on File and Storage Technologies (2016), FAST.Google Scholar
- Pamies-Juarez, L., Oggier, F., and Datta, A. Decentralized erasure coding for efficient data archival in distributed storage systems. In Distributed Computing and Networking. Springer, 2013, pp. 42--56.Google ScholarCross Ref
- Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, Oct. 2014), USENIX Association, pp. 433--448.Google Scholar
- Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure trends in a large disk drive population. In USENIX Conf. on File & Storage Technologies (FAST) (2007), pp. 17--28.Google Scholar
- Plank, J. S., and Huang, C. Tutorial: Erasure coding for storage applications. Slides presented at FAST-2013: 11th Usenix Conference on File and Storage Technologies, February 2013.Google Scholar
- Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Analysis and evolution of journaling file systems. In USENIX Annual Technical Conference, General Track (2005), pp. 105--120.Google ScholarDigital Library
- Rashmi, K., Nakkiran, P., Wang, J., Shah, N. B., and Ramchandran, K. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In 13th USENIX Conference on File and Storage Technologies (FAST 15) (Santa Clara, CA, Feb. 2015), USENIX Association, pp. 81--94.Google Scholar
- Rashmi, K. V., Shah, N. B., Gu, D., Kuang, H., Borthakur, D., and Ramchandran, K. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Conference on Hot Topics in Storage and File Systems (Berkeley, CA, USA, 2013), HotStorage'13, USENIX Association, pp. 8--8.Google ScholarDigital Library
- Rashmi, K. V., Shah, N. B., Kumar, P. V., and Ramchandran, K. Explicit construction of optimal exact regenerating codes for distributed storage. In 47th Annual Allerton Conference on Communication, Control, and Computing (2009).Google ScholarDigital Library
- Reed, I. S., and Solomon, G. Polynomial codes over certain finite fields. Journal of the Society for Industrial & Applied Mathematics 8, 2 (1960), 300--304.Google ScholarCross Ref
- Rodrigues, R., and Liskov, B. High availability in DHTs: Erasure coding vs. replication. In Peer-to-Peer Systems IV. Springer, 2005, pp. 226--239.Google Scholar
- Rosenfeld, E. RAIDP: ReplicAtion with Intra-Disk Parity. Master's thesis, Technion---Israel Institute of Technology, Israel, 2015.Google Scholar
- Rosenfeld, E., Amit, N., and Tsafrir, D. Using disk add-ons to withstand simultaneous disk failures with fewer replicas. Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA), 2013.Google Scholar
- Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., Dimakis, A. G., Vadali, R., Chen, S., and Borthakur, D. XORing elephants: Novel erasure codes for big data. Proceedings of the VLDB Endowment (2013).Google ScholarDigital Library
- Schroeder, B., and Gibson, G. A. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? In USENIX Conf. on File & Storage Technologies (FAST) (2007).Google Scholar
- Seagate. Seagate Kinetic open storage platform. https://www.seagate.com/em/en/support/enterprise-servers-storage/nearline-storage/kinetic-hdd. (Accessed: Feb. 2020).Google Scholar
- Silberstein, M., Wang, Y., Ganesh, L., Alvisi, L., and Dahlin, M. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of the 7th ACM International Systems and Storage Conference (2014), SYSTOR'14, USENIX Association.Google ScholarDigital Library
- Thinkmate. Superstorage server 6048r-e1cr72l., http://www.thinkmate.com/system/superstorage-server-6048r-e1cr72l.Google Scholar
- Tian, L., Feng, D., Jiang, H., Zhou, K., Zeng, L., Chen, J., Wang, Z., and Song, Z. PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2007), FAST '07, USENIX Association, pp. 32--32.Google Scholar
- Wang, G., Zhang, L., and Xu, W. What can we learn from four years of data center hardware failures? In IEEE/IFIP International Conference on Dependable Systems and Networks (2017), DSN.Google ScholarCross Ref
- Wang, Z., Dimakis, A. G., and Bruck, J. Rebuilding for array codes in distributed storage systems. CoRR abs/1009.3291 (2010).Google Scholar
- Weatherspoon, H., and Kubiatowicz, J. D. Erasure coding vs. replication: A quantitative comparison. In Peer-to-Peer Systems. Springer, 2002, pp. 328--337.Google ScholarCross Ref
- website, D. Poweredge rack servers. http://www.dell.com/us/business/p/poweredge-rack-servers/. (Accessed: July 2019).Google Scholar
- Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 307--320.Google ScholarDigital Library
- Wildani, A., Schwarz, T., Miller, E., and Long, D. Protecting against rare event failures in archival systems. In Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on (Sept 2009), pp. 1--11.Google Scholar
- Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. The HP AutoRAID hierarchical storage system. ACM Trans. Comput. Syst. 14, 1 (Feb. 1996), 108--136.Google ScholarDigital Library
- Xia, M., Saxena, M., Blaum, M., and Pease, D. A. A tale of two erasure codes in HDFS. In 13th USENIX Conference on File and Storage Technologies (FAST 15) (Santa Clara, CA, Feb. 2015), USENIX Association, pp. 213--226.Google ScholarDigital Library
- Xiang, L., Xu, Y., Lui, J. C., and Chang, Q. Optimal recovery of single disk failure in RDP code storage systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, NY, USA, 2010), SIGMETRICS '10, ACM, pp. 119--130.Google ScholarDigital Library
- Zhang, D., Ehsan, M., Ferdman, M., and Sion, R. Dimmer: A case for turning off dimms in clouds. In Proceedings of the ACM Symposium on Cloud Computing (New York, NY, USA, 2014), SOCC '14, ACM, pp. 11:1--11:8.Google ScholarDigital Library
Index Terms
- RAIDP: replication with intra-disk parity
Recommendations
Higher reliability redundant disk arrays: Organization, operation, and coding
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Performance of Two-Disk Failure-Tolerant Disk Arrays
RAID5 disk arrays use the rebuild process to reconstruct the contents of a failed disk on a spare disk, but this process is unsuccessful if latent sector failures are encountered or a second disk failure occurs. The high cost of data loss has led to two-...
Comments