ABSTRACT
Distributed systems are difficult to debug and understand. A key reason for this is distributed state, which is not easily accessible and must be pieced together from the states of the individual nodes in the system.
We propose Dinv, an automatic approach to help developers of distributed systems uncover the runtime distributed state properties of their systems. Dinv uses static and dynamic program analyses to infer relations between variables at different nodes. For example, in a leader election algorithm, Dinv can relate the variable leader at different nodes to derive the invariant ∀ nodes i, j, leaderi = leaderj. This can increase the developer's confidence in the correctness of their system. The developer can also use Dinv to convert an inferred invariant into a distributed runtime assertion on distributed state.
We applied Dinv to several popular distributed systems, such as etcd Raft, Hashicorp Serf, and Taipei-Torrent, which have between 1.7K and 144K LOC and are widely used. Dinv derived useful invariants for these systems, including invariants that capture the correctness of distributed routing strategies, leadership, and key hash distribution. We also used Dinv to assert correctness of the inferred etcd Raft invariants at runtime, using these asserts to detect injected silent bugs.
- M. Ahuja, A. D. Kshemkalyani, and T. Carlson. A basic unit of computation in distributed systems. In International Conference on Distributed Computing Systems (ICDCS), 1990.Google ScholarCross Ref
- AlDanial. cloc: Count Lines of Code. https://github.com/AlDanial/cloc, 2016.Google Scholar
- O. Babaoglu and M. Raynal. Specification and Verification of Dynamic Properties in Distributed Computations. Journal of Parallel and Distributed Computing, 28(2):173 -- 185, 1995. Google ScholarDigital Library
- P. Bernstein and E. Newcomer. Principles of Transaction Processing: For the Systems Professional. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. Google ScholarDigital Library
- I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring Models of Concurrent Systems from Logs of Their Behavior with CSight. In International Conference on Software Engineering (ICSE), 2014. Google ScholarDigital Library
- I. Beschastnikh, Y. Brun, M. D. Ernst, A. Krishnamurthy, and T. E. Anderson. Mining Temporal Invariants from Partially Ordered Logs. SIGOPS Open Syst. Rev., 45(3):39--46, Jan. 2012. Google ScholarDigital Library
- I. Beschastnikh, P. Wang, Y. Brun, and M. D. Ernst. Debugging distributed systems: Challenges and options for validation and debugging. Communications of the ACM, 59(8):32--37, Aug. 2016. Google ScholarDigital Library
- H. Cai and D. Thain. DisrIA: A Cost-effective Dynamic Impact Analysis for Distributed Programs. In International Conference on Automated Software Engineering (ASE), 2016. Google ScholarDigital Library
- K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM TOCS, 3(1):63--75, Feb. 1985. Google ScholarDigital Library
- J. E. Cook and A. L. Wolf. Discovering Models of Software Processes from Event-based Data. ACM TOSEM, 7(3):215--249, July 1998. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tarn, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In Symposium on Cloud Computing (SoCC), 2010. Google ScholarDigital Library
- R. Cooper and K. Marzullo. Consistent Detection of Global Predicates. In ACM/ONR Workshop on Parallel and Distributed Debugging (PADD), 1991. Google ScholarDigital Library
- CoreOS. A Distributed init System. https://github.com/coreos/fleet, 2013.Google Scholar
- CoreOS. Distributed reliable key-value store for the most critical data of a distributed system. https://github.com/coreos/etcd, 2013.Google Scholar
- CoreOS. Reboot manager for the CoreOS update engine. https://github.com/coreos/locksmith, 2014.Google Scholar
- A. Das, I. Gupta, and A. Motivala. Swim: Scalable weakly-consistent infection-style process group membership protocol. In International Conference on Dependable Systems and Networks (DSN), 2002. Google ScholarDigital Library
- Dinv homepage. https://bitbucket.org/bestchai/dinv/.Google Scholar
- Ú. Erlingsson, M. Peinado, S. Peter, and M. Budiu. Fay: Extensible Distributed Tracing from Kernels to Clusters. In Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarDigital Library
- M. D. Ernst, J. H. Perkins, P.J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao. The Daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1--3):35--45, Dec. 2007. Google ScholarDigital Library
- B. Fitzpatrick. Groupcache. https://github.com/golang/groupcache, 2014.Google Scholar
- V. K. Garg. Maximal Antichain Lattice Algorithms for Distributed Computations. In Distributed Computing and Networking, pages 240--254. Springer, 2013.Google ScholarCross Ref
- D. Geels, G. Altekar, P. Maniatis, T Roscoe, and I. Stoica. Friday: Global Comprehension for Distributed Replay. In Symposium on Networked Systems Design and Implementation (NSDI), Cambridge, MA, USA, 2007. Google ScholarDigital Library
- F. Groeneveld, A. Mesbah, and A. Van Deursen. Automatic invariant detection in dynamic web applications. Technical report, Delft University of Technology, Software Engineering Research Group, 2010.Google Scholar
- R. Gusella and S. Zatti. The Accuracy of the Clock Synchronization Achieved by TEMPO in Berkeley UNIX 4.3BSD. IEEE TSE, 15(7):847--853, July 1989. Google ScholarDigital Library
- Hashicorp. Service orchestration and management tool. https://www.serf.io/docs/internals/gossip.html, 2014.Google Scholar
- C. Hawblitzel, J. Howell, M. Kapritsos, J. R. Lorch, B. Parno, M. L. Roberts, S. Setty, and B. Zill. IronFleet: Proving Practical Distributed Systems Correct. In Symposium on Operating Systems Principles (SOSP), pages 1--17, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- Jackpal. A(nother) Bittorrent client written in the go programming language. https://github.com/jackpal/Taipei-Torrent, 2010.Google Scholar
- R. A. Jeff Overbey. Go Doctor - The Golang Refactoring Engine. http://gorefactor.org/index.html, 2014.Google Scholar
- Y Junqueira. Kademlia/Mainline DHT node in Go. https://github.com/nictuku/dht, 2012.Google Scholar
- A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey. VeriFlow: Verifying Network-Wide Invariants in Real Time. In Symposium on Networked Systems Design and Implementation (NSDI), 2013. Google ScholarDigital Library
- C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, death, and the critical transition: finding liveness bugs in systems code. In Symposium on Networked Systems Design and Implementation (NSDI), Cambridge, MA, USA, 2007. Google ScholarDigital Library
- Kubernetes. Production-Grade Container Scheduling and Management. http://kubernetes.io/, 2014.Google Scholar
- S. Kumar, S.-C. Khoo, A. Roychoudhury, and D. Lo. Inferring Class Level Specifications for Distributed Systems. In International Conference on Software Engineering (ICSE), 2012. Google ScholarDigital Library
- M. Kusano, A. Chattopadhyay, and C. Wang. Dynamic Generation of Likely Invariants for Multithreaded Programs. In International Conference on Software Engineering (ICSE), 2015. Google ScholarDigital Library
- X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D3S: Debugging Deployed Distributed Systems. In Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, CA, USA, 2008. Google ScholarDigital Library
- X. Liu, W. Lin, A. Pan, and Z. Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In Symposium on Networked Systems Design & Implementation (NSDI), 2007. Google ScholarDigital Library
- J. G. Lou, Q. Fu, Y. Wang, and J. Li. Mining dependency in distributed systems through unstructured logs analysis. SIGOPS Oper. Syst. Rev., 44(l):91--96, Mar. 2010. Google ScholarDigital Library
- J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Symposium on Operating Systems Principles (SOSP), 2015. Google ScholarDigital Library
- F. Mattern. Virtual Time and Global States of Distributed Systems. In Parallel and Distributed Algorithms, pages 215--226, 1989.Google Scholar
- P. Maymounkov and D. Mazières. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In International Workshop on Peer-to-Peer Systems (IPTPS), 2002. Google ScholarDigital Library
- T. Ne Win, M. D. Ernst, S. J. Garland, D. Kırlı, and N. Lynch. Using simulated execution in verifying distributed algorithms. Software Tools for Technology Transfer, 6(1):67--76, July 2004.Google ScholarCross Ref
- D. Ongaro and J. Ousterhout. In Search of an Understandable Consensus Algorithm. In USENIXATC, 2014. Google ScholarDigital Library
- K. J. Ottenstein and L. M. Ottenstein. The Program Dependence Graph in a Software Development Environment. SIGPLAN Not., 19(5):177--184, Apr. 1984. Google ScholarDigital Library
- J. K. Ousterhout. The Role of Distributed State. In In CMU Computer Science: a 25th Anniversary Commemorative, pages 199--217. ACM Press, 1991.Google Scholar
- P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Symposium on Networked Systems Design and Implementation (NSDI), 2006. Google ScholarDigital Library
- K. Romer and J. Ma. PDA: Passive distributed assertions for sensor networks. In International Conference on Information Processing in Sensor Networks (IPSN), 2009. Google ScholarDigital Library
- RunLim. RunLim. http://fmv.jku.at/runlim/, 2016.Google Scholar
- R. R. Sambasivan, A. X. Zheng, M. D. Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger. Diagnosing Performance Changes by Comparing Request Flows. In Symposium on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
- F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv, 22(4):299--319, Dec. 1990. Google ScholarDigital Library
- I. Sergey, J. R. Wilcox, and Z. Tatlock. Programming and Proving with Distributed Protocols. In Symposium on Principles of Programming Languages (POPL), 2018. Google ScholarDigital Library
- B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.Google Scholar
- N. Walkinshaw, M. Roper, M. Wood, and N. W. M. Roper. The Java System Dependence Graph. In International Workshop on Source Code Analysis and Manipulation (SCAM), 2003.Google Scholar
- R. J. Walls, Y. Brun, M. Liberatore, and B. N. Levine. Discovering specification violations in networked software systems. In International Symposiumon Software Reliability Engineering (ISSRE), 2015. Google ScholarDigital Library
- J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock, X. Wang, M. D. Ernst, and T Anderson. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Conference on Programming Language Design and Implementation (PLDI), 2015. Google ScholarDigital Library
- W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In Symposium on Operating Systems Principles (SOSP), 2009. Google ScholarDigital Library
- M. Yabandeh, A. Anand, M. Canini, and D. Kostic. Finding Almost-Invariants in Distributed Systems. In International Symposium on Reliable Distributed Systems (SRDS), 2011. Google ScholarDigital Library
- J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Symposium on Networked Systems Design and Implementation (NSDI), 2009. Google ScholarDigital Library
- P. Zave. Using Lightweight Modeling to Understand Chord. SIGCOMM Comput. Commun. Rev., 42(2):49--57, Mar. 2012. Google ScholarDigital Library
- X. Zhao, Y. Zhang, D. Lion, M. F. Ullah, Y. Luo, D. Yuan, and M. Stumm. Lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Symposium on Operating System Design and Implementation (OSDI), 2014. Google ScholarDigital Library
Index Terms
- Inferring and asserting distributed system invariants
Recommendations
Rotation and translation invariants of Gaussian-Hermite moments
Geometric moment invariants are widely used in many fields of image analysis and pattern recognition since their first introduction by Hu in 1962. A few years ago, Flusser has proved how to find the independent and complete set of geometric moment ...
Radial Zernike Moment Invariants
CIT '04: Proceedings of the The Fourth International Conference on Computer and Information TechnologyRadial Zernike moment invariants are special case from the complex Zernike moment invariants. The radial and angular dependence of Zernike moments is naturally separable which makes them very suitable features for achieving totational invarinces. The ...
Decidability of inferring inductive invariants
POPL '16Induction is a successful approach for verification of hardware and software systems. A common practice is to model a system using logical formulas, and then use a decision procedure to verify that some logical formula is an inductive safety invariant ...
Comments