Studying the failure characteristics of
large-scace distributed systems We have shown that many of the assumptions
made in the evaluation of traditional robustness techniques do NOT
hold in the current Internet, and the effects of these wrong assumptions
are significant.
Modeling the failures We have proposed
models capturing several important properties of failures experienced by
large-scale distributed systems on the Internet.
Developing techniques to mitigates the adverse
effects of the failures We have proposed and shown the effectiveness
of several novel techniques to mask the failures
Publications:
Beyond Availability: Towards a Deeper
Understanding of Machine Failure Characteristics in Large Distributed
Systems [ps][pdf] Praveen Yalagandula, Suman Nath, Haifeng Yu,
Phillip B. Gibbons, Srinivasan Seshan First Workshop on Real, Large Distributed Systems (WORLDS
'04).