How VMware HA Works | Deep Dive

Overview of HA (High Availability): –

  • When you creates HA cluster very first time, then Virtual Machines are configured with cluster default settings.
    • VM Restart Priority
    • Host Isolation Response
    • VM Monitoring
  • There is master host election when the cluster is first created. All other hosts are slaves.
  • Master host is responsible for monitoring the host connectivity with slave host.
  • Master host also deals with different possible issues that can happen.
    • Host get network isolated.
    • Host fails (Hardware or other problem).
    • Host loses connection to the master host.
  • For Virtual Machine monitoring, there are three options.
    • Leave running (Default)
    • Shutdown (Required VM Tools)
    • Power Off

Component of HA: –

  • FDM
  • Hostd
  • vcenter

FDM (Fault Domain Manager): –

  • Communicating host resource information, VM state, and HA properties to the other hosts.
  • It also handles heartbeat mechanism.HA deep dive
  • It provides VM placement and VM restart.
  • HA has no dependency on DNS. It works on IP Address. This is improvement in FDM.
  • FDM directly talk to hostd and vCenter
  • FDM is not dependant on VPXA.
  • You can check FDM logs – fdm.log in /var/log/

HOSTD Agent: –

  • It is agent which is installed on ESXi host.
  • Responsible for many task like power on Virtual Machine
  • If HOSTD is unavailable or not running, host will not participate in any FDM related process.
  • FDM relies on HOSTD for information about the VM that are registered to the host, and manager VM through HOSTD API.
  • FDM is dependant on HOSTD. If HOSTD is not operational, FDM halts all functions and wait for HOSTD to become operations.

Use of vCenter in HA: –

  • Deploying and configuring HA Agent.
  • Communication of cluster confiugration change
  • Protection of VM
  • Pushing out the FDM to the ESXi hosts.
  • Communicate configuratoin changes in the cluster to the host.
  • HA leverage vCenter to retrieve information about the status of VM.

Fundamental Concepts of HA: –

  • Master/Slave Agent

  • Heart-beating

  • Isolated vs Network Partitioned

  • VM Protection

  • Component Protection

Understand Master Host: –

  • HA Architecture includes the concept of Master and Slave HA agent.
  • There is only one master slave in a HA cluster, except during Network Partition scenario.master host
  • Master is responsible for monitoring the health of VM
  • Restart any VM which fails.
  • Slave pass information to master.
  • HA agent also implements the VM/App monitoring feature which allows it to restart virtual machine in case of a OS or restart service in case of application failure.
  • Master Agent keep track of the VM for which it is responsible for, and take action when appropriate.
  • Master will claim responsibility for a VM by taking ownership of the datastore on which VM configuration file is stored.
  • Master is responsible for exchanging state information with vCenter Server.
  • Send/Receive information to vCenter when required.
  • Master host initiate the restart of VM when host failed.

What if Master Fails?

master fails

HA election occurs when you enable HA on VMware Cluster and master host:-

  • Fails
  • Become Network partition or isolated.
  • Disconnect from vCenter Server.
  • Put in maintenance or standby.

HA election takes 15 seconds to elect slave as a master. It works over UDP protocol.

Make the election process on the basis of highest number of datastore.

If two or more host has some number of datastore, the highest/largest MOID will get preference. It’s done on basis of lexically MOID. Let’s take a value of MOID of two Hosts 99 and 100. Here 9(99) is greater then 1(100)(9 >1). In this example, 99 is largest MOID.

When master is elected, it will try to acquire the ownership of datastore which it can directly access by proxying request to one of the slave connected to it using the management network.

For regular storage architecture, it does this by locking a file called “Protected List”.

Naming format and location for this Protected List file is as below.

./vSphere HA/ <Cluster Specific Directory>/ProtectedList

Structure of cluster specific directory.

<UUID of VC> -<Number of the MOID of Cluster>-<Random 8 character string>-<Name of the host running VC>

Understand Slave Host: –

  • It monitors state of Virtual machine and inform Master host.
  • Monitor health of master by monitoring heatbeat.
  • Slave host sends heartbeat to master so that master can detect outage.

Local Files for HA: –

When HA is configured on a host, the host will store specific information about it’s cluster locally.

  • Cluster Config
    • It’s not human readable.
    • It contains configuration details of cluster.
  • vmmetadata
    • This file is also not human readable.
    • It contains actual compatibility information matrix for every HA protected VM and list all with which it is compatible.
    • Metadata includes the custom properties, descriptions, tags, owner, cost center, etc regarding a Virtual Machine.
  • fdm.cfg
    • Configuration setting and logging and syslog details are store in this file.
  • hostlist
    • A list of host participating in the cluster, including hostname, IP address, MAC address and heartbeat datastore.

Understand Heartbeating: –

Mechanism used by HA which check if host is alive.

There are two types of Heartbeat.

  1. Network Heartbeat
  2. Datastore Heartbeat

Network Heartbeat: –

  • It use by HA to determine if a ESXi host is alive.
  • Slave send network heartbeat to master and master to slave.
  • It send heartbeat by default every second.

Datastore Heartbeat: –

  • It add on extra level of resiliency and prevent unnecessary restart attempts.
  • Datastore heartbeat enables a master to determine the state of a host that is not reachable via management network.
  • By default there are two datastores get selected. But it can be possible to add more datastores. You can do this by following string in Advance options. Valid values can range from 2-5 and the default is 2.
das.heartbeatdsperhost
  • Selection process gives preference VMFS datastores over NFS.

Network Isolated vs Partitioned Network Partitioned: –

Isolated vs partitioned

Isolated

  • When it doesn’t observe any HA management traffic on management network and it cannot ping the configured isolation network address.
  • Host is isolated only when host inform the master via the datastore that is isolated.

Partitioned

  • When host is operational but cannot reach over management network.
  • There will be multiple masters in case of network partitioned.