



Last Friday 13th March 2009, Azure suffered from some pretty catastrophic events. The full details of what happened can be found here on the Windows Azure blog. The key point was that an operating system upgrade caused a bunch of servers to fail.
What’s interesting in that article is this point:
the Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time
Fantastic! Why do I think that?
Well, the fabric controller is designed to handle individual server fail over. When a server hosting your instance fails, it needs to spin up a new server/instance to replace it, all without you, the user, doing anything. What would happen if for some reason the fabric controller thought that EVERY server was unavailable? Would it simultaneously attempt to re-spin up a new server/instance for every single existing server as quick as possible? What would that do to the controller? What if all new instances it was spinning up were failing too? Would it flat-line the CPU?
I think you get the gist. Anyway, this is why we have CTP. Better to discover these issues now, rather than RTM.
Also, Steve Marx posted some information around the event and the communication failures on his part during the down time. Well done Steve for being so honest.










More Options ...

Categories
Tag Cloud
Blog RSS
Comments RSS

Void
Life
Earth
Wind « Default
Water
Fire
Light 
11:52 pm - March 18th, 2009
[...] And That’s Why Azure Is Still CTP… – Steve Nagy [...]