doi:10.1016/j.peva.2005.03.002
Copyright © 2005 Elsevier B.V. All rights reserved.
End-to-end latency of a fault-tolerant CORBA infrastructure
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
W. Zhao
,
, L.E. Moser
and P.M. Melliar-Smith
Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93106, USA
Received 18 February 2004;
revised 3 March 2005.
Available online 24 May 2005.
Abstract
This paper presents an evaluation of the end-to-end latency of a fault-tolerant CORBA infrastructure that we have implemented. The fault-tolerant infrastructure replicates the server applications using active, passive and semi-active replication, and maintains strong replica consistency of the server replicas. By analyses and by measurements of the running fault-tolerant infrastructure, we characterize the end-to-end latency under fault-free conditions. The main determining factor of the run-time performance of the fault-tolerant infrastructure is the Totem group communication protocol, which contributes to the end-to-end latency primarily in two ways: the delay in sending messages and the processing cost of the rotating token.
To reduce the delay in sending messages for passive and semi-active replication, the position of the primary server replica on the Totem ring, the token rotation time, the processing time at the client, and the processing time at the server must be considered. For active replication, the presence of duplicate messages adversely affects the performance. However, if an effective sending-side duplicate suppression mechanism is implemented, active replication is more advantageous than both passive and semi-active replication because of the automatic selection of the most favorable position of the server replica that sends the first non-duplicate reply.
Keywords: Performance evaluation; End-to-end latency; Fault tolerance; Distributed computing; Client–server computing; Network protocols
Fig. 1. The client and server applications with the supporting software layers and the round-trip path of a synchronous remote invocation. The various terms that contribute to the end-to-end latency are also illustrated.
Fig. 2. The measured pdfs for the end-to-end latency as seen by the client. From top to bottom, the plots are pdfs for running the primary server replica running on (a) node1, (b) node2 and (c) node3.
Fig. 3. The measured pdfs for the application-processing latency for both the client and the primary server replica. The plots from top to bottom are for semi-active replication with (a) node1 running the primary server replica, (b) node2 running the primary server replica, and (c) node3 running the primary server replica.
Fig. 4. The measured pdfs for the send delay at the client and the primary server replica. The plots from top to bottom are for semi-active replication with (a) node1 running the primary server replica, (b) node2 running the primary server replica, and (c) node3 running the primary server replica.
Fig. 5. The measured pdfs for the complete token rotation time as seen by node0 to node3 (from top to bottom) for semi-active replication with (a) node1 running the primary server replica, (b) node2 running the primary server replica, and (c) node3 running the primary server replica.
Fig. 6. Probability density functions of the end-to-end latency for active replication, (a) without and (b) with effective sending-side duplicate suppression. The client has zero “think” time in these two measurements.
Fig. 7. (a) For active replication, the measured pdfs for the client “think” time. From top to bottom, the mean “think” time increases from about 100 to 734 μs. (b) The corresponding pdfs for the end-to-end latency, with the sending-side duplicate detection mechanism disabled.
Fig. 8. (a) For active replication, the pdfs for the latencies with different server computation loads for active replication. For each run, the server computation load is fixed at the different values shown on the right-hand vertical axis. (b) To the left, the measured server processing time at the peak probability densities for different computation loads for active replication. The plot for semi-active replication is similar. To the right, the measured processing time for the same set of computation loads on an unloaded node. (c) For active replication, the peak end-to-end latency as a function of the server processing time. (d) For semi-active replication, the peak end-to-end latency as a function of the server processing time.
Fig. 9. A comparison of the end-to-end latency under the following three scenarios (from left to the right): (i) replicated server using TCP, (ii) unreplicated server running with the fault-tolerant infrastructure on a two-node Totem ring, and (iii) three-way actively replicated server running with the fault-tolerant infrastructure on a four-node Totem ring.
Fig. 10. For active replication, (a) the run-time overhead of the end-to-end latency and computation overhead at the server, for different server computation loads, and (b) the effective sending node position for the corresponding measurements given in (a) at different server computation loads.
This research has been supported by DARPA/ONR Contract N66001-00-1-8931 and MURI/AFOSR Contract F49620-00-1-0330. An earlier version of this paper won the best paper award at the International Symposium on Performance Evaluation of Computer and Telecommunication Systems [24].

Corresponding author. Present address: Department of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115, USA.