We tend to think of our physical servers as a sea of memory and CPU cores, but virtualization adds contention, with multiple cores working in shared memory space. Unless your Virtualization Consultant designing your infrastructure has a firm grasp on Non-Uniform Memory Access (NUMA) node sizes and its impact on large virtual machines, this can lead to performance issues.
Nehalem Chip onwards has memory handling inside CPUs
Intel changed its processor microarchitecture starting with the Nehalem chip. Intel moved the memory management inside the CPU chip, rather than a separate chip connected on the Northbridge. In this architecture, a particular RAM dual inline memory module (DIMM) will be attached to just one CPU socket. To get to RAM attached to another CPU socket, the RAM pages must be requested over the interconnect bus that joins the CPU sockets, then the remote socket accesses the DIMM and returns the data.
Access to remote DIMMs is slower than access to local DIMMs since it crosses an additional bus, making for a non-uniform memory architecture (NUMA). The combination of the CPU cores in one socket and the RAM that is local to that socket is called a NUMA node; the physical server's BIOS passes this information to the ESXi server at boot time. With a four-socket host, the NUMA nodes would have eight CPU cores and 32 GB of RAM.
VMware vSphere tries to jockey VMs in a NUMA node
Let's use the following example: A modern virtualization host might have four sockets each with eight cores and 16 GB RAM DIMMs, each with eight gigabytes of RAM. The whole host has 32 cores and 128 GB of RAM. This sounds like a great place to run a VM with four vCPUs and 40 GB of RAM or one with eight vCPUs and 24 GB of RAM, but both of those VM configurations make life difficult for the VMkernel and lead to potentially odd performance.
The vSphere hypervisor has been NUMA-aware since ESX 3, and it tries to keep a virtual machine inside a single NUMA node to provide the best and most consistent performance. A large VM might not fit inside a single NUMA node. A VM with four vCPUs and 40 GB of RAM configured wouldn't fit in my example NUMA node. This large VM would need to be spread across two NUMA nodes into a NUMA-wide VM. Most of the RAM would be in one NUMA node but some would be in another node and, thus, slower to access. All of the CPUs would be inside the first node -- its home NUMA node -- so every vCPU would have the same speed of access to each page of RAM. The VM with eight vCPUs and 24 GB of RAM is also too wide; although its RAM fits into its home NUMA node, two of its vCPUs need to be scheduled on another node. For the vCPUs on the non-home node, all RAM is remote so they will run slower, although the VM wouldn't know why.
Knowing your application workload and architecture is crucial here. If the wide VM can be split into two smaller VMs, each of which fits into a NUMA node, then you will likely get better performance. At the very least, you will get more consistent results, which will help when troubleshooting.
If your application is NUMA aware, even better. vSphere can create a VM that is NUMA aware with vNUMA. The VM will be divided into virtual NUMA nodes, each of which will be placed onto a different physical NUMA node. Although the virtual machine is still spread across two NUMA nodes, the operating system and application inside the VM are aware of the NUMA split and the resources will be optimized. Business critical application like Microsoft Exchange is not NUMA aware, however Microsoft SQL Server is NUMA aware.
VMware Design Experts and Consultants need to be aware of hardware configuration
Knowing the hardware is important. That means knowing the NUMA node size of the physical servers and fitting your VMs to that node size. It is also important to keep the clusters consistent, having the same NUMA node size on all the hosts in the cluster since a VM that fits in a NUMA node on eight-core CPUs may not fit so well on a host with six-core CPUs. This also affects the number of vCPUs you assign VMs; if you have several VMs with more than two vCPUs, make sure that multiple VMs fit into the NUMA core count. The example six-core NUMA nodes suit two- and three-vCPU VMs well, but multiple four-vCPU VMs per NUMA node might not perform so well since there isn't space for two of them at the same time on the NUMA node. Four vCPU virtual machines fit NUMA node sizes of four and eight cores far better.
Like so many things that impact vSphere design, awareness is the key to avoiding problems. Small VMs don't need to know or care about NUMA but the large -- and therefore critical -- VMs in your enterprise may need NUMA awareness to perform adequately. Design your large VMs to suit the NUMA architecture of your hosts, and make sure your HA and DRS clusters have a consistent NUMA node size.
Nehalem Chip onwards has memory handling inside CPUs
Intel changed its processor microarchitecture starting with the Nehalem chip. Intel moved the memory management inside the CPU chip, rather than a separate chip connected on the Northbridge. In this architecture, a particular RAM dual inline memory module (DIMM) will be attached to just one CPU socket. To get to RAM attached to another CPU socket, the RAM pages must be requested over the interconnect bus that joins the CPU sockets, then the remote socket accesses the DIMM and returns the data.
Access to remote DIMMs is slower than access to local DIMMs since it crosses an additional bus, making for a non-uniform memory architecture (NUMA). The combination of the CPU cores in one socket and the RAM that is local to that socket is called a NUMA node; the physical server's BIOS passes this information to the ESXi server at boot time. With a four-socket host, the NUMA nodes would have eight CPU cores and 32 GB of RAM.
VMware vSphere tries to jockey VMs in a NUMA node
Let's use the following example: A modern virtualization host might have four sockets each with eight cores and 16 GB RAM DIMMs, each with eight gigabytes of RAM. The whole host has 32 cores and 128 GB of RAM. This sounds like a great place to run a VM with four vCPUs and 40 GB of RAM or one with eight vCPUs and 24 GB of RAM, but both of those VM configurations make life difficult for the VMkernel and lead to potentially odd performance.
The vSphere hypervisor has been NUMA-aware since ESX 3, and it tries to keep a virtual machine inside a single NUMA node to provide the best and most consistent performance. A large VM might not fit inside a single NUMA node. A VM with four vCPUs and 40 GB of RAM configured wouldn't fit in my example NUMA node. This large VM would need to be spread across two NUMA nodes into a NUMA-wide VM. Most of the RAM would be in one NUMA node but some would be in another node and, thus, slower to access. All of the CPUs would be inside the first node -- its home NUMA node -- so every vCPU would have the same speed of access to each page of RAM. The VM with eight vCPUs and 24 GB of RAM is also too wide; although its RAM fits into its home NUMA node, two of its vCPUs need to be scheduled on another node. For the vCPUs on the non-home node, all RAM is remote so they will run slower, although the VM wouldn't know why.
Knowing your application workload and architecture is crucial here. If the wide VM can be split into two smaller VMs, each of which fits into a NUMA node, then you will likely get better performance. At the very least, you will get more consistent results, which will help when troubleshooting.
If your application is NUMA aware, even better. vSphere can create a VM that is NUMA aware with vNUMA. The VM will be divided into virtual NUMA nodes, each of which will be placed onto a different physical NUMA node. Although the virtual machine is still spread across two NUMA nodes, the operating system and application inside the VM are aware of the NUMA split and the resources will be optimized. Business critical application like Microsoft Exchange is not NUMA aware, however Microsoft SQL Server is NUMA aware.
VMware Design Experts and Consultants need to be aware of hardware configuration
Knowing the hardware is important. That means knowing the NUMA node size of the physical servers and fitting your VMs to that node size. It is also important to keep the clusters consistent, having the same NUMA node size on all the hosts in the cluster since a VM that fits in a NUMA node on eight-core CPUs may not fit so well on a host with six-core CPUs. This also affects the number of vCPUs you assign VMs; if you have several VMs with more than two vCPUs, make sure that multiple VMs fit into the NUMA core count. The example six-core NUMA nodes suit two- and three-vCPU VMs well, but multiple four-vCPU VMs per NUMA node might not perform so well since there isn't space for two of them at the same time on the NUMA node. Four vCPU virtual machines fit NUMA node sizes of four and eight cores far better.
Like so many things that impact vSphere design, awareness is the key to avoiding problems. Small VMs don't need to know or care about NUMA but the large -- and therefore critical -- VMs in your enterprise may need NUMA awareness to perform adequately. Design your large VMs to suit the NUMA architecture of your hosts, and make sure your HA and DRS clusters have a consistent NUMA node size.