Ask a cloud engineer why their virtual machine is underperforming and you'll typically hear the usual suspects: insufficient vCPU allocation, memory pressure, network bottlenecks, or storage I/O saturation. What rarely gets named despite being one of the most pervasive and measurable culprits in multisocket server environments is NUMA misalignment. Specifically, the performance penalty that accumulates when a virtual machine's workload is forced to repeatedly cross NUMA boundaries to fetch data from memory that isn't local to the CPU executing the thread.

NUMA boundary crossings are not a niche edge case. They are a structural feature of nearly every modern server platform, and they become a critical performance variable the moment a VM is sized, configured, or migrated in a way that misaligns its virtual topology with the physical hardware beneath it. Understanding the mechanics of this problem and how hypervisors either mitigate or amplify it is essential for anyone engineering cloud infrastructure for performancesensitive workloads.

What NUMA Means at the Hardware Level

NonUniform Memory Access (NUMA) is an architectural design used in all modern multisocket server platforms. Rather than providing a single shared memory bus for all CPUs, each physical CPU socket has its own dedicated bank of local DRAM and its own memory controllers. When a CPU core needs to read data from its own locally attached memory, the operation is fast and efficient. When that same core needs data residing in the memory bank attached to a different socket, the request must traverse an intersocket interconnect Intel's UPI (Ultra Path Interconnect) or AMD's Infinity Fabric to reach the remote memory controller.

This traversal has a quantifiable cost. Research benchmarking AMD EPYC Rome and Intel Cascade Lake processors found that remote socket memory access on EPYC Rome introduces roundtrip latencies in the range of 200218 nanoseconds depending on the specific crossnode path, compared to substantially lower local access times. On Intel's Xeon Skylake architecture, the "memory directory" feature optimized to minimize local latency at the expense of remote latency can push crosssocket access times into the 800nanosecond range under contention, a significant amplification of what should be a predictable hardware operation.

For individual memory accesses, a few hundred nanoseconds may sound inconsequential. But in applications that issue millions of memory operations per second relational databases, inmemory caches, realtime analytics engines, machine learning inference the aggregate cost of systematically hitting remote NUMA memory is not measured in nanoseconds. It is measured in reduced query throughput, elevated CPU utilization, and degraded applicationtier response times.

How NUMA Boundaries Are Crossed in Virtualized Environments

In a physical server, NUMA boundary crossings occur when the operating system's thread scheduler places a thread on a CPU core whose local memory doesn't contain that thread's working set. Modern NUMAaware operating systems manage this through policies like Linux's firsttouch allocation and NUMA balancing, which attempt to migrate pages toward the socket that most frequently accesses them.

In a virtualized environment, the problem has an additional layer: the hypervisor must map virtual CPUs (vCPUs) and VM memory onto physical NUMA resources. When this mapping is done well, the VM's workload remains NUMAlocal. When it is done poorly or when the VM is sized or configured in a way that prevents good placement virtually every memory access the VM issues may be crossing a physical NUMA boundary without the guest operating system having any visibility into it.

There are several specific ways this misalignment occurs in practice.

VM Sizing That Exceeds a Single NUMA Node

Every physical server has a fixed number of CPU cores and a fixed amount of RAM per NUMA node. On a dualsocket AMD EPYC 9654 host with SMT enabled, for example, each socket has 96 physical cores and 192 logical threads. The physical NUMA boundary is at the socket level. A VM configured with more vCPUs than a single NUMA node can accommodate will inherently span multiple NUMA nodes.

When a VM spans two physical NUMA nodes, the hypervisor must distribute both its vCPUs and its memory allocation across both nodes. This is not inherently catastrophic it can be managed through virtual NUMA (vNUMA) topology exposure, where the hypervisor presents the underlying NUMA structure to the guest OS, allowing NUMAaware applications to optimize their own thread and memory placement. But when the vNUMA topology exposed to the guest does not accurately reflect the physical topology, or when it is not exposed at all, the guest OS has no information to make good scheduling decisions. It places threads and allocates memory without NUMA awareness, and the result is a workload that continuously crosses physical NUMA boundaries.

vNUMA Misconfiguration and Topology Mismatch

VMware's vSphere platform has evolved its vNUMA handling significantly over time, but configuration pitfalls remain common. In vSphere 6.5 and later, the hypervisor automatically determines an optimal vNUMA topology based on the underlying physical host. However, this automatic determination considers only the compute dimension it does not account for memory sizing. A VM configured with memory that exceeds a single physical NUMA node's capacity, but with a vCPU count that falls within one node's bounds, will be assigned a single vNUMA node. The guest OS sees a single NUMA domain, believes it has local memory access, and makes no effort to balance allocations while physically, much of its memory resides on a remote node.

Similarly, enabling vCPU hotadd on a VMware VM disables vNUMA entirely. The guest OS then sees all vCPUs as belonging to a single NUMA domain regardless of the physical topology below. For workloads like SQL Server, which is deeply NUMAaware and uses NUMA topology to partition its internal memory objects and thread schedulers, this misconfiguration can introduce significant CPU wait contention. Some practitioners have reported CPU cost increases of up to 30% when vNUMA is inadvertently disabled through hotadd configuration.

SubNUMA Clustering and ClusteronDie Configurations

Modern server processors, particularly AMD EPYC and Intel Xeon, support BIOSlevel settings that subdivide each physical socket into smaller NUMA domains. AMD's NPS (NUMA Nodes Per Socket) setting can divide a single physical package into 1, 2, or 4 NUMA nodes. Intel's SubNUMA Clustering (SNC) divides each socket into 2 or 3 clusters.

These configurations are designed to improve NUMA locality for workloads that fit within the smaller subdomains, as the effective local memory latency decreases when fewer cores share a memory controller. However, for larger VMs whose vCPU and memory requirements span multiple of these subdomains, SNC and NPS settings that divide the NUMA topology finely can actually worsen performance. VMware's own guidance notes that SNC configurations can constrain performance for large workloads by creating more NUMA boundaries for transactions to cross, effectively multiplying the number of NUMA hops a large VM's operations must traverse.

vMotion Migration to Hosts with Different Physical Topologies

vNUMA topology is locked at VM poweron time and cannot be changed while the VM is running. When a vMotion live migration moves a VM to a host with a different physical NUMA layout different core counts per socket, different memory per node the vNUMA topology presented to the guest OS no longer matches the physical host. The guest OS continues making NUMAaware decisions based on the topology it was told at boot, but those decisions now reflect a phantom architecture. Memory allocation choices that were NUMAoptimal on the original host become NUMAsuboptimal or actively NUMAremote on the destination.

This is a particularly insidious form of NUMA misalignment because it introduces no obvious error, no warning in guest OS logs, and no visible configuration change. The VM simply begins performing worse after migration, and diagnosing the root cause requires correlating migration events with performance metrics at a level of granularity that most monitoring stacks don't capture by default.

The Performance Impact: What the Numbers Show

The performance consequences of NUMA boundary crossings scale with workload characteristics. Memorybound workloads those that issue frequent random memory accesses across large working sets are hit hardest because each access that misses local cache and requires DRAM is potentially a NUMAremote access.

For database workloads, NUMA misalignment typically manifests as elevated latch wait times, increased CMEMTHREAD contention in SQL Server, higher buffer pool miss rates, and reduced query throughput under concurrent load. For inmemory keyvalue stores and caching layers, NUMAremote memory access can directly inflate operation latency. For multithreaded analytical workloads, crossNUMA memory traffic saturates intersocket interconnect bandwidth, creating a shared bottleneck that degrades all threads regardless of whether they themselves are making remote accesses.

Tools like esxtop on VMware ESXi expose a NUMA Locality metric (the %L value in memory view) that shows what percentage of a VM's memory is being accessed locally. Values below 80% in NUMA Locality are a reliable indicator of NUMA misalignment meaning at least one in five memory accesses the VM makes is crossing a physical NUMA boundary. On heavily consolidated hosts with VMs of varying sizes, NUMA Locality values below 60% are not uncommon and correlate strongly with elevated application latency and CPU utilization.

Strategies to Minimize NUMA Boundary Crossings

RightSizing VMs to Fit Physical NUMA Boundaries

The most effective strategy is also the most straightforward: size VMs so their vCPU count and memory allocation fit within a single physical NUMA node. This eliminates the need for crossnode placement entirely. For hosts where VMs must span NUMA nodes, the next best approach is to ensure the vNUMA topology exposed to the guest is accurately symmetrical distributing vCPUs and memory evenly across vNUMA nodes that map cleanly to their physical counterparts.

On dualsocket AMD EPYC 9654 hosts with four physical NUMA nodes across the two sockets, a large VM requiring 144 vCPUs is optimally configured with four vNUMA nodes of 36 vCPUs each, with two vNUMA nodes fitting per physical socket. This symmetry ensures that intrasocket NUMA hops are minimized and crosssocket NUMA traffic is balanced rather than concentrated.

Disable CPU HotAdd for PerformanceCritical VMs

For database VMs, highthroughput API servers, or any workload where NUMAaware scheduling in the guest is important, vCPU hotadd should be disabled. The convenience of adding CPUs without a reboot is not worth the cost of collapsing the guest's NUMA topology into a single flat domain. This is not a niche optimization it is a documented best practice for any vNUMAdependent workload.

Use NUMAAware Monitoring to Validate Placement

Regular monitoring of NUMA locality metrics is underutilized in most cloud and virtualized environments. Building NUMA locality checks into routine performance reviews using esxtop on ESXi hosts, numastat in Linux guest VMs, or equivalent tooling on other hypervisors provides early warning before NUMA misalignment becomes a crisis. In SQL Server environments, periodic checks on NUMA node topology in Server Properties and CMEMTHREAD wait accumulation serve as effective proxies for NUMA health.

Leverage Modern Hypervisor NUMA Features

VMware vSphere 8's vTopology framework introduced dynamic vNUMA adjustment that automatically reconfigures the guest's virtual NUMA topology when CPU or memory resources are hotadded, eliminating one of the primary causes of topology staleness after configuration changes. Similarly, the enhanced DRS (Distributed Resource Scheduler) in vSphere 8 incorporates NUMAaware workload placement, using memory locality as a factor in both initial placement decisions and live migration choices. For organizations running performancesensitive workloads, upgrading to hypervisor versions that support these NUMAaware placement features is a meaningful operational improvement.

Choose Infrastructure with Transparent NUMA Characteristics

Not all cloud infrastructure is created equal when it comes to NUMA visibility and control. Public cloud providers typically abstract hardware topology entirely the guest VM has no insight into the physical NUMA architecture of the host, and users have limited ability to influence vCPUtoNUMA mapping. For workloads where NUMA locality is a performance prerequisite, infrastructure that provides transparent hardware topology and dedicated compute resources is a meaningful advantage. Platforms like AceCloud's highperformance cloud infrastructure are designed to give operators meaningful control over compute resource placement, enabling the kind of NUMAaligned VM configurations that make the difference between predictable lowlatency performance and unexplained throughput degradation.

NUMA in the Age of HighCoreCount Processors

As AMD EPYC and Intel Xeon platforms push toward 96, 128, and 192 cores per socket, the NUMA complexity within a single physical package has increased substantially. AMD's chiplet architecture introduces intrasocket NUMA effects through its IO die design, where cores on different chiplets (CCDs) have varying latency paths to different memory controllers. The latest AMD EPYC 9005 series (Turin) handles this relatively gracefully worstcase unloaded crossnode latencies within a single socket are under 140 nanoseconds but optimal performance still requires matching workload placement to cache and memory topology at a finer granularity than was necessary with monolithic CPU designs.

Intel's Xeon 6 takes a different architectural approach, maintaining a more monolithic view of ondie memory access, but individual cores can still see substantially elevated latency (over 180 nanoseconds) when accessing memory controllers at the far end of the die. Both architectures reward NUMAaware placement and penalize randomized memory allocation.

For cloud architects, this increasing intrasocket NUMA complexity means that even singlesocket servers are no longer NUMAflat. A VM placed entirely on a single socket may still experience meaningful NUMA effects depending on which chiplet its vCPUs land on and where its memory pages are allocated. Choosing dedicated cloud infrastructure where hardware topology is transparent, and where VM placement can be deliberately aligned to cache and memory boundaries, is increasingly a prerequisite for deterministic application performance not an optional configuration refinement.

The Bottom Line

NUMA boundary crossings are a fundamental performance variable in any virtualized multisocket server environment. They are often invisible to application developers, frequently misconfigured by platform administrators, and systematically undermonitored by standard observability tooling. Yet their impact on memory latency, CPU efficiency, database throughput, and application response time is measurable, consistent, and directly addressable through correct VM sizing, vNUMA configuration, hypervisor feature utilization, and infrastructure selection.

As server processors grow more complex with higher core counts and multichiplet designs, NUMA topology awareness becomes more critical, not less. Teams that treat NUMA alignment as a foundational infrastructure concern rather than a tuning footnote are the ones that achieve the consistent, predictable performance that modern cloud applications demand.

Optimizing NUMA topology in virtualized environments is one of the highestleverage infrastructure changes available to teams experiencing unexplained performance variability and one of the most consistently overlooked.