Last Friday 13th March 2009, Azure suffered from some pretty catastrophic events. The full details of what happened can be found here on the Windows Azure blog. The key point was that an operating system upgrade caused a bunch of servers to fail.
What’s interesting in that article is this point:
the Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time
Fantastic! Why do I think that?
Well, the fabric controller is designed to handle individual server fail over. When a server hosting your instance fails, it needs to spin up a new server/instance to replace it, all without you, the user, doing anything. What would happen if for some reason the fabric controller thought that EVERY server was unavailable? Would it simultaneously attempt to re-spin up a new server/instance for every single existing server as quick as possible? What would that do to the controller? What if all new instances it was spinning up were failing too? Would it flat-line the CPU?
I think you get the gist. Anyway, this is why we have CTP. Better to discover these issues now, rather than RTM.