Abstract
Network failures are frequent and disruptive, and can significantly reduce the throughput even in highly connected and regular networks such as datacenters. While many modern networks support some kind of local fast failover to quickly reroute flows encountering link failures to new paths, employing such mechanisms is known to be non-trivial, as conditional failover rules can only depend on local failure information.
While over the last years, important insights have been gained on how to design failover schemes providing high resiliency, existing approaches have the shortcoming that the resulting failover routes may be unnecessarily long, i.e., they have a large stretch compared to the original route length. This is a serious drawback, as long routes entail higher latencies and introduce loads, which may cause the rerouted flows to interfere with existing flows and harm throughput.
This paper presents the first deterministic local fast failover algorithms providing provable resiliency and failover route lengths, even in the presence of many concurrent failures. We present stretch-optimal failover algorithms for different network topologies, including multi-dimensional grids, hypercubes and Clos networks, as they are frequently deployed in the context of HPC clusters and datacenters. We show that the computed failover routes are optimal in the sense that no failover algorithm can provide shorter paths for a given number of link failures.
- 1 M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. ACM SIGCOMM CCR, 38(4):63–74, 2008. Google ScholarDigital Library
- 2 A. Bhalgat, R. Hariharan, T. Kavitha, and D. Panigrahi. Fast edge splitting and edmonds' arborescence construction for unweighted graphs. In Proc. SODA, 2008. Google ScholarDigital Library
- 3 M. Borokhovich, L. Schiff, and S. Schmid. Provable data plane connectivity with local fast failover: Introducing openflow graph algorithms. In Proc. ACM SIGCOMM HotSDN, 2014. Google ScholarDigital Library
- 4 M. Borokhovich and S. Schmid. How (not) to shoot in your foot with sdn local fast failover: A load-connectivity tradeoff. In Proc. OPODIS, 2013. Google ScholarDigital Library
- 5 M. Chiesa, A. Gurtov, A. Madry, S. Mitrovic, I. Nikolaevkiy, A. Panda, M. Schapira, and S. Shenker. Exploring the limits of static failover routing (v4). arXiv:1409.0034 {cs.NI}, 2016.Google Scholar
- 6 M. Chiesa, I. Nikolaevskiy, S. Mitrovic, A. V. Gurtov, A. Madry, M. Schapira, and S. Shenker. On the resiliency of static forwarding tables. IEEE/ACM Trans. Netw., 25(2):1133–1146, 2017. Google ScholarDigital Library
- 7 J. Edmonds. Edge-disjoint branchings. Combinatorial algorithms, 9(91-96):2, 1973.Google Scholar
- 8 E. Gafni and D. Bertsekas. Distributed algorithms for generating loop-free routes in networks with frequently changing topology. Trans. Commun., 29(1):11–18, 1981.Google ScholarCross Ref
- 9 P. Gill, N. Jain, and N. Nagappan. Understanding network failures in data centers: measurement, analysis, and implications. SIGCOMM CCR, 41:350–361, 2011. Google ScholarDigital Library
- 10 C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. Bcube: a high performance, server-centric network architecture for modular data centers. In Proc. ACM SIGCOMM, 2009. Google ScholarDigital Library
- 11 P. Hall. On representatives of subsets. Journal of the London Mathematical Society, s1-10(1):26–30, 1935.Google Scholar
- 12 M. Kaufmann and K. Mehlhorn. A linear-time algorithm for the homotopic routing problem in grid graphs. SIAM J. on Computing, 23(2):227–246, 1994. Google ScholarDigital Library
- 13 V. Liu, D. Halperin, A. Krishnamurthy, and T. E. Anderson. F10: A fault-tolerant engineered network. In Proc. USENIX NSDI, 2013. Google ScholarDigital Library
- 14 A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, and C. Diot. Characterization of failures in an ip backbone. In Proc. IEEE INFOCOM, 2004.Google ScholarCross Ref
- 15 Y.-A. Pignolet, S. Schmid, and G. Tredan. Load-optimal local fast rerouting for dependable networks. In Proc. DSN, 2017.Google Scholar
- 16 S. Schmid and J. Srba. Polynomial-time what-if analysis for prefix-manipulating mpls networks. In Proc. IEE INFOCOM, 2018.Google ScholarCross Ref
- 17 A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of ospf behavior in a large enterprise network. In Proc. IMW. ACM, 2002. Google ScholarDigital Library
- 18 B. Stephens, A. L. Cox, and S. Rixner. Plinko: Building provably resilient forwarding tables. In Proc. ACM HotNets, 2013. Google ScholarDigital Library
- 19 B. Stephens, A. L. Cox, and S. Rixner. Scalable multi-failure fast failover via forwarding table compression. SOSR. ACM, 2016. Google ScholarDigital Library
- 20 R. Stong. Hamilton decompositions of cartesian products of graphs. Disc. Math., 90(2):169 – 190, 1991. Google ScholarDigital Library
- 21 Y. Wang, H. Wang, A. Mahimkar, R. Alimi, Y. Zhang, L. Qiu, and Y. R. Yang. R3: resilient routing reconfi- guration. ACM SGICOMM CCR, 40(4):291–302, 2010. Google ScholarDigital Library
- 22 D. Xu, Y. Xiong, C. Qiao, and G. Li. Failure protection in layered networks with shared risk link groups. IEEE network, 2004. Google ScholarDigital Library
Index Terms
- Local Fast Failover Routing With Low Stretch
Recommendations
Improving the Resilience of Fast Failover Routing: TREE (Tree Routing to Extend Edge disjoint paths)
ANCS '21: Proceedings of the Symposium on Architectures for Networking and Communications SystemsToday's communication networks have stringent availability requirements and hence need to rapidly restore connectivity after failures. Modern networks thus implement various forms of fast reroute mechanisms in the data plane, to bridge the gap to slow ...
Scalable Multi-Failure Fast Failover via Forwarding Table Compression
SOSR '16: Proceedings of the Symposium on SDN ResearchIn datacenter networks, link and switch failures are a common occurrence. Although most of these failures do not disconnect the underlying topology, they do cause routing failures, disrupting communications between some hosts. Unfortunately, current 1:1 ...
A Tight Characterization of Fast Failover Routing: Resiliency to Two Link Failures is Possible
SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and ArchitecturesTo achieve fast recovery from link failures, most modern communication networks feature local fast failover mechanisms in the data plane. These failover mechanisms typically rely on pre-installed static rerouting rules which can depend only on local ...
Comments