The Azure Fabric Controller (FC) is the service which monitors, maintains and provisions machines to host the applications that we (the developer) create and store in the Microsoft cloud.
Previously I’ve helped define the word ‘fabric’ and in specific, discuss the details of the Azure Fabric and the Development Fabric. The Azure Fabric Controller is responsible for managing all the nodes and edges in the Azure Fabric, which is essentially servers (both provisioned and not), load balancers (usually hardware balancers), power-on automation devices, switches, routers, etc.
The Fabric Controller manages different devices in different ways. For example, hardware load balancers are supported through a driver model. Each balancer could be different hardware type, vendor, etc. Azure abstracts the communication by exposing the balancer through a custom driver for that specific model. However, how it manages powered-on servers is slightly different.
There is a special service that runs on all powered-on servers/instances and the Fabric Controller communicates with the server via this service. The service tracks two things: the ‘current state’ of the server and the ‘goal state’. A goal might be to run a worker instance, or it might also be to remain idle as part of the free inventory. The current state might be something like ‘initialising’ or ‘idle’. The Fabric Controller and the local service can then manage how the system gets to the goal state from the current state.
When an error occurs, the service detects the fault and changes the current state accordingly (something like ‘faulted’). Once again the Fabric Controller and the service can manage what’s required to get back to the goal state. This might mean a reboot, or perhaps reprovisioning the whole server. The Fabric Controller can take alternative options like provisioning another resource to host your instance.
This mechanism is quite useful in that repeat patterns of failures or hardware faults can be easily identified and a server can be marked as ‘inoperable’.
One of the key roles of the Fabric Controller is to provision resources based on the needs of the applications written by the developer. To manage this it has a declarative service model that defines exactly what is needed by the application. This model covers things like what roles the application performs and how those roles communicate, what operating system requirements there are (does it need IIS for example), how much CPU is needed, bandwidth required, etc. It can even specify what guest operating system to use, and if a dedicated box is required or if virtualisation is enough.
There is also some redundancy tolerance at the provisioning level, referred to as ‘fault domains’ and ‘update domains’. For example, you can specify that a particular application be distributed over 3 fault domains, meaning that your application will be located in different parts of the fabric such that server or switch failures will only bring down 1 instance. The Fabric Controller can model a certain amount of risk to sections of the fabric based on areas of single point of failure, and it uses these statistics when deploying your application into the fabric. This also applies to ‘update domains’ which essentially ensure that system updates that take services offline will affect your application 1 piece at a time, meaning you can ensure continuous availability.
When it comes time to provision a resource for your instance (being 1 instance of 1 role), the Fabric Controller will examine those specified requirements, and look through its inventory (fabric) for a resource that matches. It then changes the Goal State of the node and the provisioning process begins.
At this point you might be saying: “Hey, I can’t configure any of this stuff right now you liar!”. And you’d be right: as the developer we don’t yet have this fine grained control over how the Fabric Controller manages our applications. For the CTP, what we have been given is some templates instead, known to you and I as the Web and Worker Roles. However on the Azure side, these templates are translated into some predefined specifications around fault and upgrade domains, software requirements, and machine level resources. For example, these will always be a Server 2008 Enterprise running x64, 1.7Gb of RAM, and 250Gb disk space. In the future we will see more specific control become available, specifically to organisations who pursue an SLA route with Microsoft. We should see some of this later in the year (2009).
The Fabric Controller itself is highly redundant, with 5 to 7 replicas being available at any given time. The state of all the nodes in the fabric is replicated across all of these replicas to ensure that no matter which Fabric Controller is managing your particular node, its state tracking is 100% up to date. In the event that all Fabric Controller nodes go down, all existing services will still continue to run. However the provisioning and fault tolerance aspects will obviously be offline.
What I find really interesting is that the Fabric Controller replicas are all managed by a miniature version of Azure as well. This means that there is a service definition for the “Azure Fabric Controller” application which is deployed as a set number of instances, and has support for all the same kinds of fault and update domains. A new fabric controller can be provisioned automatically should there be failures, etc.
That’s about all I wanted to cover with the Fabric Controller. There’s a lot more to learn about how Azure manages its infrastructure, especially around deployment of host images and virtualised guests’ images. A future post perhaps.