vSphere 6.7 ICM – Topic 9.1 – Explain the vSphere HA architecture

Downtime is something which always costs to company. VMware  helps to reduce downtime at each layer.

  1. Component Level (NIC multi-pathing ,storage multi-pathing)
  2. Server level(vMotion and DRS)
  3. Storage level (sDRS)

Similarly vSphere HA provides a base level of protection for your virtual machines by restarting virtual machines in the event of a host failure. vSphere HA is configured  on multiple ESXi hosts cluster to provide quick recovery in case of outage. vSphere give HA as cost effective solution for high availability  for the application running on virtual machine. HA protects against:

  • Host failure
  • Data store accessibility issue.
  • virtual machine against network isolation
  • Application failure.

Hosts in the cluster are monitored and in the event of a failure, the virtual machines on a failed host get restarted on alternate hosts with in the cluster.

When you create a vSphere HA cluster, a single host is automatically elected as the master host. The master host communicates with vCenter Server and monitors the state of all protected virtual machines and of the slave hosts. Different types of host failures are possible, and the master host must detect and appropriately deal with the failure. The master host must distinguish between a failed host and one that is in a network partition or that has become network isolated. The master host uses network and datastore heartbeating to determine the type of failure.

Master and Subordinate Hosts

When you add a host to a vSphere HA cluster, an agent (Fault Domain Manager (FDM)) is uploaded to the host and configured to communicate with other agents in the cluster. Each host in the cluster functions as a master host or a subordinate host. After the FDM agents have started, the cluster hosts are said to be in a fault domain.Hosts cannot participate in a fault domain if they are in maintenance mode, standby mode, or disconnected from vCenter Server.

As discussed above when vSphere HA is enabled for a cluster, all active hosts participate in an election to choose the cluster’s master host. The host that mounts the greatest number of datastores has an advantage in the election.If more than one cluster hosts see the same number of datastores, the election process determines the master host by using the host managed object ID (MOID) assigned by vCenter Server.

If the master host fails, is shut down or put in standby mode, or is removed from the cluster a new election is held.

The master host in a cluster has several responsibilities:

  • Monitoring the state of subordinate hosts. If a subordinate host fails or becomes unreachable, the master host identifies which virtual machines must be restarted.

  • Monitoring the power state of all protected virtual machines. If one virtual machine fails, the master host ensures that it is restarted. Using a local placement engine, the master host also determines where the restart takes place.

  • Managing the lists of cluster hosts and protected virtual machines.

  • Acting as the vCenter Server management interface to the cluster and reporting the cluster health state.

The subordinate hosts primarily contribute to the cluster by running virtual machines locally, monitoring their runtime states, and reporting state updates to the master host. A master host can also run and monitor virtual machines.

Master host is responsible to orchestrate restarts of protected virtual machines. A virtual machine is protected by a master host after vCenter Server observes that the virtual machine’s power state has changed from powered off to powered on in response to a user action. The master host persists the list of protected virtual machines in the cluster’s datastores. A newly elected master host uses this information to determine which virtual machines to protect.

Network Heartbeats

Master hosts send heartbeats periodically to subordinate hosts to know that master is live. Slave host communicate to master via management network.If the slave host does not respond within predefined timeout period, the master host declares the slave host as agent unreachable. When a slave host is not responding, the master host attempts to
determine the cause of the slave host’s inability to respond.

Datastore Heartbeats

The datastore heartbeats are used to make the distinction between a failed and isolated or partitioned
host. vSphere HA tries to restart virtual machines only in one of these situations:
• A host has failed (no network heartbeats, no ping, no datastore heartbeats).
• A host becomes isolated and the cluster’s configured host isolation response is Power off or Shut down.

Virtual Machine Component Protection

VMCP provides protection against datastore accessibility failures that can affect a virtual machine  running on a host in a vSphere HA cluster. When a datastore accessibility failure occurs, the affected host can no longer access the storage path for a specific datastore. You can determine the response that vSphere HA will make to such a failure, ranging from the creation of event alarms to virtual machine restarts on other hosts.
Only vSphere HA clusters that contain ESXi 6 hosts can be used to enable VMCP. Clusters that contain hosts from an earlier release cannot enable VMCP. Such hosts cannot be added to a cluster enabled for VMCP.

Proactive HA Failures

A Proactive HA failure occurs when a host component fails, which results in a loss of redundancy or a noncatastrophic failure. However, the functional behavior of the VMs residing on the host is not yet affected. For example, if a power supply on the host fails, but other power supplies are available, that is a Proactive HA failure.

If a Proactive HA failure occurs, you can automate the remediation action taken in the vSphere Availability section of the vSphere Client. The VMs on the affected host can be evacuated to other hosts and the host is either placed in Quarantine mode or Maintenance mode.