ABSTRACT
Recent advances in formal verification techniques enabled the implementation of distributed systems with machine-checked proofs. While results are encouraging, the importance of distributed systems warrants a large scale evaluation of the results and verification practices.
This paper thoroughly analyzes three state-of-the-art, formally verified implementations of distributed systems: Iron-Fleet, Verdi, and Chapar. Through code review and testing, we found a total of 16 bugs, many of which produce serious consequences, including crashing servers, returning incorrect results to clients, and invalidating verification guarantees. These bugs were caused by violations of a wide-range of assumptions on which the verified components relied. Our results revealed that these assumptions referred to a small fraction of the trusted computing base, mostly at the interface of verified and unverified components. Based on our observations, we have built a testing toolkit called PK, which focuses on testing these parts and is able to automate the detection of 13 (out of 16) bugs.
- M. Abadi and L. Lamport. The existence of refinement mappings. Theoretical Computer Science, 82(2):253--284, 1991. Google ScholarDigital Library
- M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, and P. W. Hutto. Causal memory: Definitions, implementation, and programming. Distributed Computing, 9(1):37--49, 1995. Google ScholarDigital Library
- S. Amani, A. Hixon, Z. Chen, C. Rizkallah, P. Chubb, L. O'Connor, J. Beeren, Y. Nagashima, J. Lim, T. Sewell, J. Tuong, G. Keller, T. Murray, G. Klein, and G. Heiser. Cogent: Verifying high-assurance file system implementations. In Proceedings of the 21th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 175--188, Atlanta, GA, Apr. 2016. Google ScholarDigital Library
- M. Barnett, B.-Y. E. Chang, R. DeLine, B. Jacobs, and K. R. M. Leino. Boogie: A modular reusable verifier for object-oriented programs. In Formal methods for Components and Objects, pages 364--387. Springer, 2005.Google Scholar
- A. D. Birrell and B. J. Nelson. Implementing remote procedure calls. ACM Trans. Comput. Syst., 2(1):39--59, Feb. 1984. ISSN 0734-2071.Google ScholarDigital Library
- C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. EXE: Automatically generating inputs of death. In Proceedings of the 13th ACM Conference on Computer and Communications Security (CCS), pages 322--335, Alexandria, VA, Oct.-Nov. 2006. Google ScholarDigital Library
- C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), pages 209--224, San Diego, CA, Dec. 2008.Google ScholarDigital Library
- Q. Carbonneaux, J. Hoffmann, T. Ramananandro, and Z. Shao. End-to-end verification of stack-space bounds for C programs. In Proceedings of the 2014 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 270--281, Edinburgh, UK, June 2014. Google ScholarDigital Library
- Q. Carbonneaux, J. Hoffmann, and Z. Shao. Compositional certified resource bounds. In Proceedings of the 2015 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 467--478, Portland, OR, June 2015. Google ScholarDigital Library
- M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI), pages 173--186, New Orleans, LA, Feb. 1999.Google ScholarDigital Library
- H. Chen, Y. Mao, X. Wang, D. Zhou, N. Zeldovich, and M. F. Kaashoek. Linux kernel vulnerabilities: State-of-the-art defenses and open problems. In Proceedings of the 2nd Asia-Pacific Workshop on Systems, Shanghai, China, July 2011. 5 pages. Google ScholarDigital Library
- H. Chen, D. Ziegler, T. Chajed, A. Chlipala, M. F. Kaashoek, and N. Zeldovich. Using Crash Hoare Logic for certifying the FSCQ file system. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), Monterey, CA, Oct. 2015. Google ScholarDigital Library
- A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating systems errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), pages 73--88, Chateau Lake Louise, Banff, Canada, Oct. 2001. Google ScholarDigital Library
- Coq development team. Coq Reference Manual, Version 8.4pl5. INRIA, Oct. 2014. http://coq.inria.fr/distrib/current/refman/.Google Scholar
- L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337--340, Budapest, Hungary, Mar.-Apr. 2008. Google ScholarCross Ref
- R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on test data selection: Help for the practicing programmer. Computer, 11(4):34--41, Apr. 1978. ISSN 0018-9162.Google ScholarDigital Library
- R. W. Floyd. Assigning meanings to programs. In Proceedings of the American Mathematical Society Symposia on Applied Mathematics, volume 19, pages 19--31, 1967. Google ScholarCross Ref
- P. Fonseca, C. Li, V. Singhal, and R. Rodrigues. A study of the internal and external effects of concurrency bugs. In Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 221--230, Chicago, IL, June 2010. Google ScholarCross Ref
- P. Fonseca, C. Li, and R. Rodrigues. Finding complex concurrency bugs in large multi-threaded applications. In Proceedings of the ACM EuroSys Conference, pages 215--228, New York, NY, USA, Apr. 2011. Google ScholarDigital Library
- S. J. Garland and N. A. Lynch. Using I/O automata for developing distributed systems. Foundations of Component-Based Systems, 13:285312, 2000.Google Scholar
- Z. Guo, S. McDirmid, M. Yang, L. Zhuang, P. Zhang, Y. Luo, T. Bergan, P. Bodik, M. Musuvathi, Z. Zhang, and L. Zhou. Failure recovery: When the cure is worse than the disease. In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS), Santa Ana Pueblo, NM, May 2013.Google Scholar
- A. Gupta, C. Popeea, and A. Rybalchenko. Predicate abstraction and refinement for verifying multi-threaded programs. In Proceedings of the 38th ACM Symposium on Principles of Programming Languages (POPL), pages 331--344, Austin, TX, Jan. 2011. Google ScholarDigital Library
- C. Hawblitzel, J. Howell, M. Kapritsos, J. R. Lorch, B. Parno, M. L. Roberts, S. Setty, and B. Zill. IronFleet: Proving practical distributed systems correct. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), Monterey, CA, Oct. 2015. Google ScholarDigital Library
- M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages Systems, 12(3):463--492, 1990. Google ScholarDigital Library
- C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576--580, Oct. 1969. Google ScholarDigital Library
- J. Hoenicke, R. Majumdar, and A. Podelski. Thread modularity at many levels: A pearl in compositional verification. In Proceedings of the 44th ACM Symposium on Principles of Programming Languages (POPL), pages 473--485, Paris, France, Jan. 2017. Google ScholarDigital Library
- C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), pages 243--256, Cambridge, MA, Apr. 2007.Google Scholar
- G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, M. Norrish, R. Kolanski, T. Sewell, H. Tuch, and S. Winwood. seL4: Formal verification of an OS kernel. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pages 207--220, Big Sky, MT, Oct. 2009. Google ScholarDigital Library
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558--565, July 1978. ISSN 0001-0782.Google ScholarDigital Library
- L. Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(3):872--923, 1994. Google ScholarDigital Library
- L. Lamport. The temporal logic of actions. ACM Trans. Program. Lang. Syst., 16(3):872--923, May 1994. ISSN 0164-0925.Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, 1998. Google ScholarDigital Library
- C. Lee, S. J. Park, A. Kejriwal, S. Matsushita, and J. Ousterhout. Implementing linearizability at large scale and low latency. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), pages 71--86, Monterey, CA, Oct. 2015. Google ScholarDigital Library
- K. R. M. Leino. Dafny: An automatic program verifier for functional correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR), pages 348--370, Dakar, Senegal, Apr.-May 2010. Google ScholarCross Ref
- X. Leroy. Formal verification of a realistic compiler. Communications of the ACM, 52(7):107--115, July 2009. Google ScholarDigital Library
- M. Lesani, C. J. Bell, and A. Chlipala. Chapar: Certified causally consistent distributed key-value stores. In Proceedings of the 43rd ACM Symposium on Principles of Programming Languages (POPL), pages 357--370, St. Petersburg, FL, Jan. 2016. Google ScholarDigital Library
- Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), pages 289--302, San Francisco, CA, Dec. 2004.Google ScholarDigital Library
- B. Liskov. Primitives for distributed computing. In Proceedings of the 7th ACM Symposium on Operating Systems Principles (SOSP), pages 33--12, Pacific Grove, CA, Dec. 1979. Google ScholarDigital Library
- X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D3S: Debugging deployed distributed systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), pages 423--437, San Francisco, CA, Apr. 2008.Google Scholar
- W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Stronger semantics for low-latency geo-replicated storage. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI), pages 313--328, Lombard, IL, Apr. 2013.Google Scholar
- L. Lu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and S. Lu. A study of Linux file system evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST), pages 31--44, San Jose, CA, Feb. 2013.Google ScholarDigital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 329--339, Seattle, WA, Mar. 2008. Google ScholarDigital Library
- T. Lu. Formal verification of the Pastry protocol using TLA+. In Proceedings of the 1st International Symposium on Dependable Software Engineering: Theories, Tools, and Applications, pages 284--299, Nov. 2015. Google ScholarDigital Library
- G. C. Necula. Proof-carrying code. In Proceedings of the 24th ACM Symposium on Principles of Programming Languages (POPL), pages 106--119, Paris, France, Jan. 1997. Google ScholarDigital Library
- G. C. Necula and P. Lee. Safe kernel extensions without run-time checking. In Proceedings of the 2nd Symposium on Operating Systems Design and Implementation (OSDI), pages 229--243, Seattle, WA, Oct. 1996. Google ScholarDigital Library
- B. Nitzberg and V. Lo. Distributed shared memory: A survey of issues and algorithms. Computer, 24(8):52--60, Aug. 1991. ISSN 0018-9162.Google ScholarDigital Library
- D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference, pages 305--319, Philadelphia, PA, June 2014.Google ScholarDigital Library
- T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Predicting the location and number of faults in large software systems. Software Engineering, IEEE Transactions on, 31(4):340--355, 2005.Google ScholarDigital Library
- C. Scott, V. Brajkovic, G. Necula, A. Krishnamurthy, and S. Shenker. Minimizing faulty executions of distributed systems. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI), pages 291--309, Santa Clara, CA, Mar. 2016.Google Scholar
- H. Sigurbjarnarson, J. Bornholt, E. Torlak, and X. Wang. Push-button verification of file systems via crash refinement. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI), pages 1--16, Savannah, GA, Nov. 2016.Google ScholarDigital Library
- H. Sigurbjarnarson, J. Bornholt, E. Torlak, and X. Wang. Push-button verification of file systems via crash refinement. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI), pages 1--16, Savannah, GA, Nov. 2016.Google ScholarDigital Library
- M. Sullivan and R. Chillarege. A comparison of software defects in database management systems and operating systems. In Fault-Tolerant Computing, 1992. FTCS-22. Digest of Papers., Twenty-Second International Symposium on, pages 475--184. IEEE, 1992. Google ScholarCross Ref
- D. B. Terry, A. J. Demers, K. Petersen, M. Spreitzer, M. Theimer, and B. W. Welch. Session guarantees for weakly consistent replicated data. In Proceedings of the 3rd IEEE International Conference on Parallel and Distributed Information Systems (PDIS), pages 140--149, Washington, DC, Sept. 1994. Google ScholarCross Ref
- J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock, X. Wang, M. D. Ernst, and T. Anderson. Verdi: A framework for implementing and formally verifying distributed systems. In Proceedings of the 2015 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 357--368, Portland, OR, June 2015. Google ScholarDigital Library
- M. Yabandeh, N. Knežević, D. Kostić, and V. Kuncak. CrystalBall: Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), pages 229--244, San Francisco, CA, Apr. 2008.Google Scholar
- J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MoDist: Transparent model checking of unmodified distributed systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), pages 213--228, Boston, MA, Apr. 2009.Google Scholar
- X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and understanding bugs in C compilers. In Proceedings of the 2011 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 283--294, San Jose, CA, June 2011. Google ScholarDigital Library
- D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U. Jain, and M. Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), pages 249--265, Broomfield, CO, Oct. 2014.Google ScholarDigital Library
- P. Zave. Using lightweight modeling to understand Chord. SIGCOMM Comput. Commun. Rev., 42(2):49--57, Mar. 2012. ISSN 0146-4833.Google ScholarDigital Library
- An Empirical Study on the Correctness of Formally Verified Distributed Systems
Recommendations
Verdi: a framework for implementing and formally verifying distributed systems
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationDistributed systems are difficult to implement correctly because they must handle both concurrency and failures: machines may crash at arbitrary points and networks may reorder, drop, or duplicate packets. Further, their behavior is often too complex ...
A Formally Verified NAT
SIGCOMM '17: Proceedings of the Conference of the ACM Special Interest Group on Data CommunicationWe present a Network Address Translator (NAT) written in C and proven to be semantically correct according to RFC 3022, as well as crash-free and memory-safe. There exists a lot of recent work on network verification, but it mostly assumes models of ...
Code optimizations using formally verified properties
OOPSLA '13: Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applicationsFormal program verification offers strong assurance of correctness, backed by the strength of mathematical proof. Constructing these proofs requires humans to identify program invariants, and show that they are always maintained. These invariants are ...
Comments