Automatic Software Upgrades for Distributed Systems (PhD thesis)
Upgrading the software of long-lived, highly-available distributedsystems is difficult. It is not possible to upgrade all the nodes in asystem at once, since some nodes may be unavailable and halting thesystem for an upgrade is unacceptable. Instead, upgrades may happengradually, and there may be long periods of time when different nodesare running different software versions and need to communicate usingincompatible protocols. We present a methodology and infrastructurethat address these challenges and make it possible to upgradedistributed systems automatically while limiting service disruption.Our methodology defines how to enable nodes to interoperate acrossversions, how to preserve the state of a system across upgrades, and howto schedule an upgrade so as to limit service disruption. The approachis modular: defining an upgrade requires understanding only the newsoftware and the version it replaces.The upgrade infrastructure is a generic platform for distributing andinstalling software while enabling nodes to interoperate acrossversions. The infrastructure requires no access to the system sourcecode and is transparent: node software is unaware that differentversions even exist. We have implemented a prototype of theinfrastructure called Upstart that intercepts socket communication usinga dynamically-linked C++ library. Experiments show that Upstart has lowoverhead and works well for both local-area and Internet systems.