Abstract
Distributed protocols such as 2PC and Paxos lie at the core of many systems in the cloud, but standard implementations do not scale. New scalable distributed protocols are developed through careful analysis and rewrites, but this process is ad hoc and error-prone. This paper presents an approach for scaling any distributed protocol by applying rule-driven rewrites, borrowing from query optimization. Distributed protocol rewrites entail a new burden: reasoning about spatiotemporal correctness. We leverage order-insensitivity and data dependency analysis to systematically identify correct coordination-free scaling opportunities. We apply this analysis to create preconditions and mechanisms for coordination-free decoupling and partitioning, two fundamental vertical and horizontal scaling techniques. Manual rule-driven applications of decoupling and partitioning improve the throughput of 2PC by 5× and Paxos by 3×, and match state-of-the-art throughput in recent work. These results point the way toward automated optimizers for distributed protocols based on correct-by-construction rewrite rules.
- Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley. http://webdam.inria.fr/Alice/pdfs/all.pdfGoogle ScholarDigital Library
- Ittai Abraham, Guy Gueta, Dahlia Malkhi, Lorenzo Alvisi, Ramakrishna Kotla, and Jean-Philippe Martin. 2017. Revisiting Fast Practical Byzantine Fault Tolerance. CoRR, Vol. abs/1712.01367 (2017). arxiv: 1712.01367 http://arxiv.org/abs/1712.01367Google Scholar
- Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, and Tevfik Kosar. 2020. WPaxos: Wide Area Network Flexible Consensus. IEEE Transactions on Parallel and Distributed Systems, Vol. 31, 1 (2020), 211--223. https://doi.org/10.1109/TPDS.2019.2929793Google ScholarDigital Library
- Peter Alvaro, Tom J Ameloot, Joseph M Hellerstein, William Marczak, and Jan Van den Bussche. 2011a. A declarative semantics for Dedalus. UC Berkeley EECS Technical Report, Vol. 120 (2011), 2011.Google Scholar
- Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak. 2011b. Consistency Analysis in Bloom: a CALM and Collected Approach. In Fifth Biennial Conference on Innovative Data Systems Research, CIDR 2011, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings. 249--260. http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdfGoogle Scholar
- Peter Alvaro, William R. Marczak, Neil Conway, Joseph M. Hellerstein, David Maier, and Russell Sears. 2011c. Dedalus: Datalog in Time and Space. In Datalog Reloaded, Oege de Moor, Georg Gottlob, Tim Furche, and Andrew Sellers (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 262--281.Google ScholarDigital Library
- Tom J. Ameloot, Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. 2017. Parallel-Correctness and Transferability for Conjunctive Queries. Journal of the ACM, Vol. 64, 5 (Oct. 2017), 1--38. https://doi.org/10.1145/3106412Google ScholarDigital Library
- Mohammad Javad Amiri, Chenyuan Wu, Divyakant Agrawal, Amr El Abbadi, Boon Thau Loo, and Mohammad Sadoghi. 2022. The bedrock of bft: A unified platform for bft protocol design and implementation. arXiv preprint arXiv:2205.04534 (2022).Google Scholar
- Carolyn Jane Anderson, Nate Foster, Arjun Guha, Jean-Baptiste Jeannin, Dexter Kozen, Cole Schlesinger, and David Walker. 2014. NetKAT: Semantic foundations for networks. Acm sigplan notices, Vol. 49, 1 (2014), 113--126.Google ScholarDigital Library
- Mahesh Balakrishnan, Chen Shen, Ahmed Jafri, Suyog Mapara, David Geraghty, Jason Flinn, Vidhya Venkat, Ivailo Nedelchev, Santosh Ghosh, Mihir Dharamshi, Jingming Liu, Filip Gruszczynski, Jun Li, Rounak Tibrewal, Ali Zaveri, Rajeev Nagar, Ahmed Yossef, Francois Richard, and Yee Jiun Song. 2021. Log-Structured Protocols in Delos. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (Virtual Event, Germany) (SOSP '21). Association for Computing Machinery, New York, NY, USA, 538--552. https://doi.org/10.1145/3477132.3483544Google ScholarDigital Library
- Christian Berger and Hans P Reiser. 2018. Scaling byzantine consensus: A broad analysis. In Proceedings of the 2nd workshop on scalable and resilient infrastructures for distributed ledgers. 13--18.Google ScholarDigital Library
- Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. ACM SigPlan Notices, Vol. 30, 8 (1995), 207--216.Google ScholarDigital Library
- George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput., Vol. 38, 1--2 (2012), 37--51.Google ScholarDigital Library
- Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review, Vol. 44, 3 (2014), 87--95.Google ScholarDigital Library
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering, Vol. 38, 4 (2015).Google Scholar
- David Chu, Rithvik Panchapakesan, Shadaj Laddad, Lucky Katahanas, Chris Liu, Kaushik Shivakumar, Natacha Crooks, Joseph M. Hellerstein, and Heidi Howard. 2024. Optimizing Distributed Protocols with Query Rewrites [Technical Report]. https://github.com/rithvikp/autocomp.Google Scholar
- Neil Conway, Peter Alvaro, Emily Andrews, and Joseph M Hellerstein. 2014. Edelweiss: Automatic storage reclamation for distributed programming. Proceedings of the VLDB Endowment, Vol. 7, 6 (2014), 481--492.Google ScholarDigital Library
- Neil Conway, William R Marczak, Peter Alvaro, Joseph M Hellerstein, and David Maier. 2012. Logic and lattices for distributed programming. In Proceedings of the Third ACM Symposium on Cloud Computing. 1--14.Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design I& Implementation - Volume 6 (San Francisco, CA) (OSDI'04). USENIX Association, USA, 10.Google Scholar
- Amol Deshpande, Zachary Ives, Vijayshankar Raman, et al. 2007. Adaptive query processing. Foundations and Trends® in Databases, Vol. 1, 1 (2007), 1--140.Google ScholarDigital Library
- David DeWitt and Jim Gray. 1992. Parallel database systems. Commun. ACM, Vol. 35, 6 (June 1992), 85--98. https://doi.org/10.1145/129888.129894Google ScholarDigital Library
- David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, and M. Muralikrishna. 1986. GAMMA - A High Performance Dataflow Database Machine. In VLDB. 228--237.Google Scholar
- Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert Van Renesse. 2020. Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA, 325--338. https://www.usenix.org/conference/nsdi20/presentation/dingGoogle Scholar
- Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. 1988. Consensus in the presence of partial synchrony. Journal of the ACM (JACM), Vol. 35, 2 (1988), 288--323.Google ScholarDigital Library
- Shinya Fushimi, Masaru Kitsuregawa, and Hidehiko Tanaka. 1986. An Overview of The System Software of A Parallel Relational Database Machine GRACE.. In VLDB, Vol. 86. 209--219.Google Scholar
- Sumit Ganguly, Avi Silberschatz, and Shalom Tsur. 1990. A framework for the parallel processing of datalog queries. ACM SIGMOD Record, Vol. 19, 2 (1990), 143--152.Google ScholarDigital Library
- Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. 2019. Parallel-Correctness and Containment for Conjunctive Queries with Union and Negation. ACM Transactions on Computational Logic, Vol. 20, 3 (July 2019), 1--24. https://doi.org/10.1145/3329120Google ScholarDigital Library
- Gaetano Geck, Frank Neven, and Thomas Schwentick. 2020. Distribution Constraints: The Chase for Distributed Data. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ICDT.2020.13Google ScholarCross Ref
- Rachid Guerraoui, Nikola Knevz ević, Vivien Quéma, and Marko Vukolić. 2010. The next 700 BFT protocols. In Proceedings of the 5th European conference on Computer systems. 363--376.Google ScholarDigital Library
- Suyash Gupta, Mohammad Javad Amiri, and Mohammad Sadoghi. 2023. Chemistry behind Agreement. In Conference on Innovative Data Systems Research (CIDR).(2023).Google Scholar
- Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles. 1--17.Google ScholarDigital Library
- Joseph M. Hellerstein and Peter Alvaro. 2020. Keeping CALM: When Distributed Consistency is Easy. Commun. ACM, Vol. 63, 9 (Aug. 2020), 72--81. https://doi.org/10.1145/3369736Google ScholarDigital Library
- Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 12, 3 (1990), 463--492.Google ScholarDigital Library
- Martin Hirzel, Robert Soulé, Buug ra Gedik, and Scott Schneider. 2018. Stream Query Optimization. Springer International Publishing, 1--9.Google Scholar
- Heidi Howard and Ittai Abraham. 2020. Raft does not Guarantee Liveness in the face of Network Faults. https://decentralizedthoughts.github.io/2020--12--12-raft-liveness-full-omission/.Google Scholar
- Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. 2016. Flexible paxos: Quorum intersection revisited. arXiv preprint arXiv:1608.06696 (2016).Google Scholar
- Heidi Howard and Richard Mortier. 2020. Paxos vs Raft. In Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data. ACM. https://doi.org/10.1145/3380787.3393681Google ScholarDigital Library
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.Google ScholarDigital Library
- Mohammad M Jalalzai, Costas Busch, and Golden G Richard. 2019. Proteus: A scalable BFT consensus protocol for blockchains. In 2019 IEEE international conference on Blockchain (Blockchain). IEEE, 308--313.Google ScholarCross Ref
- Bas Ketsman and Christoph Koch. 2020. Datalog with Negation and Monotonicity. In 23rd International Conference on Database Theory (ICDT 2020) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 155), Carsten Lutz and Jean Christoph Jung (Eds.). Schloss Dagstuhl--Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 19:1--19:18. https://doi.org/10.4230/LIPIcs.ICDT.2020.19Google ScholarCross Ref
- Bas Ketsman, Paraschos Koutris, et al. 2022. Modern Datalog Engines. Foundations and Trends® in Databases, Vol. 12, 1 (2022), 1--68.Google ScholarCross Ref
- Igor Konnov, Jure Kukovec, and Thanh-Hai Tran. 2019. TLA model checking made symbolic. Proceedings of the ACM on Programming Languages, Vol. 3, OOPSLA (2019), 1--30.Google ScholarDigital Library
- Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst., Vol. 16, 2 (May 1998), 133--169. https://doi.org/10.1145/279227.279229Google ScholarDigital Library
- Leslie Lamport. 2002. Specifying systems: the TLA language and tools for hardware and software engineers. (2002).Google Scholar
- Boon Thau Loo, Tyson Condie, Minos Garofalakis, David E Gay, Joseph M Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. 2009. Declarative networking. Commun. ACM, Vol. 52, 11 (2009), 87--95.Google ScholarDigital Library
- C Mohan, Bruce Lindsay, and Ron Obermarck. 1986. Transaction management in the R* distributed database management system. ACM Transactions on Database Systems (TODS), Vol. 11, 4 (1986), 378--396.Google ScholarDigital Library
- Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in Egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM. https://doi.org/10.1145/2517349.2517350Google ScholarDigital Library
- Inderpal Singh Mumick and Oded Shmueli. 1995. How expressive is stratified aggregation? Annals of Mathematics and Artificial Intelligence, Vol. 15 (1995), 407--435.Google ScholarCross Ref
- Ray Neiheiser, Miguel Matos, and Lu'is Rodrigues. 2021. Kauri: Scalable bft consensus with pipelined tree-based dissemination and aggregation. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 35--48.Google ScholarDigital Library
- Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. 2015. How Amazon web services uses formal methods. Commun. ACM, Vol. 58, 4 (2015), 66--73.Google ScholarDigital Library
- Diego Ongaro. 2014. Consensus : bridging theory and practice. Ph.,D. Dissertation. Stanford University.Google Scholar
- Kenneth J. Perry and Sam Toueg. 1986. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering, Vol. SE-12, 3 (1986), 477--482. https://doi.org/10.1109/TSE.1986.6312888Google ScholarDigital Library
- George Pirlea. 2023. Errors found in distributed protocols. https://github.com/dranov/protocol-bugs-list.Google Scholar
- Mingwei Samuel, Joseph M Hellerstein, and Alvin Cheung. 2021. Hydroflow: A Model and Runtime for Distributed Systems Programming. (2021).Google Scholar
- Bruhathi Sundarmurthy, Paraschos Koutris, and Jeffrey Naughton. 2021. Locality-Aware Distribution Schemes. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ICDT.2021.22Google ScholarCross Ref
- Florian Suri-Payer, Matthew Burke, Zheng Wang, Yunhao Zhang, Lorenzo Alvisi, and Natacha Crooks. 2021. Basil. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM. ACM. https://doi.org/10.1145/3477132.3483552Google ScholarDigital Library
- Pierre Sutra. 2020. On the correctness of Egalitarian Paxos. Inform. Process. Lett., Vol. 156 (2020), 105901. https://doi.org/10.1016/j.ipl.2019.105901Google ScholarDigital Library
- Immanuel Trummer, Samuel Moseley, Deepak Maram, Saehan Jo, and Joseph Antonakakis. 2018. Skinnerdb: regret-bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment, Vol. 11, 12 (2018), 2074--2077.Google ScholarDigital Library
- Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos Made Moderately Complex. ACM Comput. Surv., Vol. 47, 3, Article 42 (Feb. 2015), 36 pages. https://doi.org/10.1145/2673577Google ScholarDigital Library
- Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. 2015. Vive La Diffé rence: Paxos vs. Viewstamped Replication vs. Zab. IEEE Transactions on Dependable and Secure Computing, Vol. 12, 4 (July 2015), 472--484. https://doi.org/10.1109/tdsc.2014.2355848Google ScholarDigital Library
- Zhaoguo Wang, Changgeng Zhao, Shuai Mu, Haibo Chen, and Jinyang Li. 2019. On the Parallels between Paxos and Raft, and how to Port Optimizations. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing. ACM. https://doi.org/10.1145/3293611.3331595Google ScholarDigital Library
- Michael Whittaker. 2020. mwhittaker/craq_bug. https://github.com/mwhittaker/craq_bug.Google Scholar
- Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein, Heidi Howard, Ion Stoica, and Adriana Szekeres. 2021a. Scaling Replicated State Machines with Compartmentalization. Proc. VLDB Endow., Vol. 14, 11 (July 2021), 2203--2215. https://doi.org/10.14778/3476249.3476273Google ScholarDigital Library
- Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein, Heidi Howard, Ion Stoica, and Adriana Szekeres. 2021b. Scaling Replicated State Machines with Compartmentalization [Technical Report]. arxiv: 2012.15762 [cs.DC]Google Scholar
- Michael Whittaker, Neil Giridharan, Adriana Szekeres, Joseph Hellerstein, and Ion Stoica. 2021c. SoK: A Generalized Multi-Leader State Machine Replication Tutorial. Journal of Systems Research, Vol. 1, 1 (2021).Google ScholarCross Ref
- James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. 2015. Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. 357--368.Google ScholarDigital Library
- Jianan Yao, Runzhou Tao, Ronghui Gu, Jason Nieh, Suman Jana, and Gabriel Ryan. 2021. DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols.. In OSDI. 405--421.Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). USENIX Association, Boston, MA. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-setsGoogle Scholar
- Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the twenty-fourth ACM symposium on operating systems principles. 423--438.Google ScholarDigital Library
- Jingren Zhou, Per-Ake Larson, and Ronnie Chaiken. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, 1060--1071.Google ScholarCross Ref
Index Terms
- Optimizing Distributed Protocols with Query Rewrites
Recommendations
Optimizing Multiset Relational Algebra Queries Using Weak-Equivalent Rewrite Rules
Foundations of Information and Knowledge SystemsAbstractRelational query languages rely heavily on costly join operations to combine tuples from multiple tables into a single resulting tuple. In many cases, the cost of query evaluation can be reduced by manually optimizing (parts of) queries to use ...
Optimizing large star-schema queries with snowflakes via heuristic-based query rewriting
CASCON '03: Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative researchUser queries have been becoming increasingly complex (e.g., involving a large number of joins) as database technology is applied to some application domains such as data warehouses and life sciences. Query optimizers in existing database management ...
Query evaluation using overlapping views: completeness and efficiency
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataWe study the problem of finding efficient equivalent view-based rewritings of relational queries, focusing on query optimization using materialized views under the assumption that base relations cannot contain duplicate tuples. A lot of work in the ...
Comments