TY - GEN
T1 - Load balancing in the presence of random node failure and recovery
AU - Dhakal, Sagar
AU - Hayat, Majeed M.
AU - Pezoa, Jorge E.
AU - Abdallah, Chaouki T.
AU - Birdwell, J. Doug
AU - Chiasson, John
PY - 2006
Y1 - 2006
N2 - In many distributed computing systems that are prone to either induced or spontaneous node failures, the number of available computing resources is dynamically changing in a random fashion. A load-balancing (LB) policy for such systems should therefore be robust, in terms of workload re-allocation and effectiveness in task completion, with respect to the random absence and re-emergence of nodes as well as random delays in the transfer of workloads among nodes. In this paper two LB policies for such computing environments are presented: The first policy takes an initial LB action to preemptively counteract the consequences of random failure and recovery of nodes. The second policy compensates for the occurrence of node failure dynamically by transferring loads only at the actual failure instants. A probabilistic model, based on the concept of regenerative processes, is presented to assess the overall performance of the system under these policies. Optimal performance of both policies is evaluated using analytical, experimental and simulation-based results. The interplay between node-failure/recovery rates and the mean load-transfer delay are highlighted.
AB - In many distributed computing systems that are prone to either induced or spontaneous node failures, the number of available computing resources is dynamically changing in a random fashion. A load-balancing (LB) policy for such systems should therefore be robust, in terms of workload re-allocation and effectiveness in task completion, with respect to the random absence and re-emergence of nodes as well as random delays in the transfer of workloads among nodes. In this paper two LB policies for such computing environments are presented: The first policy takes an initial LB action to preemptively counteract the consequences of random failure and recovery of nodes. The second policy compensates for the occurrence of node failure dynamically by transferring loads only at the actual failure instants. A probabilistic model, based on the concept of regenerative processes, is presented to assess the overall performance of the system under these policies. Optimal performance of both policies is evaluated using analytical, experimental and simulation-based results. The interplay between node-failure/recovery rates and the mean load-transfer delay are highlighted.
UR - http://www.scopus.com/inward/record.url?scp=33847152657&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2006.1639293
DO - 10.1109/IPDPS.2006.1639293
M3 - Conference contribution
AN - SCOPUS:33847152657
SN - 1424400546
SN - 9781424400546
T3 - 20th International Parallel and Distributed Processing Symposium, IPDPS 2006
BT - 20th International Parallel and Distributed Processing Symposium, IPDPS 2006
PB - IEEE Computer Society
T2 - 20th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2006
Y2 - 25 April 2006 through 29 April 2006
ER -