Damian Karlson: Objective 4.1 – Implement and Maintain Complex VMware HA Solutions

Knowledge

Identify the three admission control policies for HA
Identify heartbeat options and dependencies

Skills and Abilities

Calculate host failure requirements
Configure customized isolation response settings
Configure HA redundancy in a mixed ESX/ESXi environment
Configure HA related alarms and monitor an HA cluster
Create a custom slot size configuration
Understand interactions between DRS and HA
Create an HA solution that ensures primary node distribution across sites
Analyze vSphere environment to determine appropriate HA admission control policy
Analyze performance metrics to calculate host failure requirements
Analyze Virtual Machine workload to determine optimum slot size
Analyze HA cluster capacity to determine optimum cluster size

Tools

vSphere Availability Guide
Product Documentation
vSphere Client

Notes

Configure customized isolation response settings

The isolation response is the action HA will take when the heartbeat network is isolated. The response is either power off, leave powered on
(default), or shut down.
HA will try to restart the affected Virtual Machines and by default will try up to five times. This is configurable with the parameter
das.maxvmrestartcount
The default value for isolation failure detection is 15 seconds. This is configurable with the parameter das.failuredetectiontime
At this point a restart is initiated by one of the primary hosts. The isolation response is actually initiated 1 second before the failure
detection time.

Read the below excerpt from Duncan Epping’s blog on why it is important to understand these different parameters and what affect configuring isolation response settings can have on your environment.

The default value for isolation/failure detection is 15 seconds. In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by one of the primary hosts.

For now let’s assume the isolation response is “power off”. The “power off”(isolation response) will be initiated by the isolated host 1 second before the das.failuredetectiontime. A “power off” will be initiated on the fourteenth second and a restart will be initiated on the fifteenth second.

Does this mean that you can end up with your VMs being down and HA not restarting them?

Yes, when the heartbeat returns between the 14th and 15th second the “power off” could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated anymore.

How can you avoid this?

Pick “Leave VM powered on” as an isolation response. Increasing the das.failuredetectiontime will also decrease the chances of running in to issues like these.

Basic design principle: Increase “das.failuredetectiontime” to 30 seconds (30000) to decrease the likely-hood of a false positive.

A couple of additional parameters I’ve found necessary

das.isolationaddressx Used to configure multiple isolation addresses.

das.usedefaultisolationaddress Set to true/false and used in the cse where a default gateway is not pingable, in which case this set to false in conjunction with configuring another address for das.isolationaddress

Configure HA redundancy in a mixed ESX/ESXi environment

Redundant management networks are recommended for HA and there are two options to choose from, Network Redundancy Using NIC Teaming or Network Redundancy Using a Secondary Network.

Create a custom slot size configuration

A slot refers to a logical representation to power on a virtual machine in the cluster, taking into account memory and CPU resources that will satisfy this request.

Slot size is calculated using the highest CPU and memory reservation of any given VM, with a default of 256mhz for CPU and 0MB + memory overhead if no reservations are specified.

Set das.slotCpuInMHz or das.slotMemInMB to manually lower the slot size in cases where one VM causes a really large slot size.

Understand interactions between DRS and HA

In terms of functionality today, DRS and HA while separate pieces do work together to provide availability and performance. When a host goes down and
HA performs a failover and restarts a virtual machine it is looking for to provide availability. It is then DRS’s job to balance these machines across
the cluster.

Create an HA solution that ensures primary node distribution across sites

The first 5 hosts that join the HA cluster are automatically selected as primary nodes.

You can manually view which nodes are primary with this command

cat /var/log/vmware/aam/aam_config_util_listnodes.log

Re-election of a primary will only occur when a primary is placed in maintenance mode, disconnected or removed from the cluster, or when a cluster is reconfigured for HA.

If all fail simultaneously, there is no HA initiated restart of VMs that occur.

In order to design an HA solution that ensures a primary is always available, the placement of your hosts is crucial. For each cluster, never put more than four hosts in a place where it could be a single point of failure, for example a chassis of blades. If you have 10 blades and two chassis, look to separate the blades amongst two chassis, and additionally make sure that no more than four blades are from one cluster in each chassis.

Read more about this concept here: http://www.yellow-bricks.com/2009/02/09/blades-and-ha-cluster-design/

Analyze vSphere environment to determine appropriate HA admission control policy

You have three choices for HA admission control policies

Host failures cluster tolerates

Host with most slots is taken out of the equation, and then the next most if more than one 1 is set.
Your percentage of resources should be equal or larger than your largest host so that all VMs on that host can be restarted.
Tends to be very conservative as largest reservation dictates the slot size.

Percentage of cluster resources reserved as failover spare capacity

Instead of using slot sizes, sets a percentage of resources to be left unused for HA purposes
Tends to be a more realistic view of reservations as it uses actual reservations vs. slot size.
More flexible.

Specify a failover host

Specifies a single host specifically for failover purposes. You may end up with a lot of reserved capacity for failover, but you also will only
get a single host for use as a failover server.

Analyze Virtual Machine workload to determine optimum slot size

If you are manually specifying the memory and CPU values for the slot size, make sure the slot size is representative of typical workloads.

Analyze HA cluster capacity to determine optimum cluster size

How many hosts in your cluster?

How many host failures can you tolerate?

What is the resource utilization of your virtual machines?

Other Relevant Articles and Links Related to this Section

Knowledge

Skills and Abilities

Tools

Notes

Share this:

Like this: