13 min read
Created on
Updated on

Azure Local – Design the infrastructure


I have seen many designs of Azure Local stacks that have encountered issues that could have been prevented if the design was changed to follow some best practices. I wanted to share some of the things I have come across that should never have been built the way it was. I have also included some drawings of minimal designs based on my experiences.

Pre-deployment

Break-glass accounts

Azure Local hardens the built-in Administrator account during cluster creation. It renames the account and changes the password as part of its security hardening procedure — and the new credentials are not documented anywhere. If you later lose network connectivity to the domain (for example during an update where a NIC driver fails), you will not be able to log in locally because the original credentials no longer work.

The solution is to create at least two local administrator accounts on each node immediately after OS installation, before the cluster is deployed. Test them, document them, and store the credentials in a password vault. Make sure they are not the same across nodes - both names and passwords should be unique across nodes.

net user <nodename>BreakGlass1 "YourSecurePassword14+" /add
net localgroup Administrators <nodename>BreakGlass1 /add

net user <nodename>BreakGlass2 "AnotherSecurePassword14+" /add
net localgroup Administrators <nodename>BreakGlass2 /add

Do this before joining the domain or starting the cluster deployment. Without break-glass accounts, a lockout scenario can result in a full rebuild.

Password requirements

Azure Local requires passwords of 14–72 characters containing at least 3 of: uppercase, lowercase, numbers, and special characters. Make sure both the local administrator password and LCM User passwords meet these requirements from the start.

Active Directory

At some point we will likely not need Active Directory anymore. Now in general availability, we can start use Azure KeyVault instead. But for now, lets keep on Active Directory in our deployment scope. Source: https://learn.microsoft.com/en-us/azure/azure-local/deploy/deployment-local-identity-with-key-vault?view=azloc-2604

Joined Azure Local to production domain

I see it all the time. Azure Local is been joined to the same domain as production where all kinds of member servers and clients also are joined to (client devices should not be in any domain, they should be joined to Entra ID and only connected to domain member servers via Cloud Kerberos Trust). This is not a recommended approach. Yes, if your domain is fully configured with LAPS, Tiering, Kerberos armoring, disabling of older protocols e.g., you would have a “safe” domain that you could join your Azure Local nodes into. However, I would still not do it. I will always prefer a dedicated management Active Directory for Azure Local. This is because I can then segment this management domain completely from all other areas of my environment. In some industries this is referred to as “out-off-band”. This management Active Directory should only be reached from Azure Local nodes management network and from a secured instance of Azure Bastion.

Active Directory for Azure Local running on top of stack

I believe this speaks for itself, but having the Active Directory that Azure Local is joined to, running on top of the same stack, is prone to fail at some point. That is why I like running management Active Directory in Azure. No need for extra hardware for management domain, just running a few small VMs in Azure.

Active Directory DNS Servers

Then deploying Azure Local 23H2 and newer, we need to be aware that the DNS servers we specify for the MOC IP Range, will be very hard to change afterwards (require redeployment of the MOC). If possible, deploy a private DNS resolver in Azure, use Azure Firewall with DNS Proxy or another method of DNS proxy. In some cases I just use IP addresses of Active Directory DNS servers directly, but I always make sure the customer understand this deployment option and that they cannot just swap IP addresses of those DNS servers at a later time without a huge redeployment job of the MOC.

Networking

One Azure Local stack per management VLAN

You may run into serious trouble if you choose to deploy multiple Azure Local Stacks on the same management VLAN. Azure Local stacks must always have separate VLANs for storage if using the storage configuration with a storage switch between nodes.

Compute VLANs can be shared between stacks.

And you would think that sharing a large management network for all your stacks would not cause any issues as long as no IP overlapping is present.

Security wise it is recommended to have a separate VLAN for each Azure Local management network, but there is another issue; Installing Azure Local on one stack, and then later you want to install a new Azure Local stack on one of the nodes (maybe due to issues that require total redeployment using a new stack deployment on the first node and later move workloads and reinstall other nodes into new stack), the MOC ARB can reuse the same MAC address, causes unstable network and lose of access from Azure to the Azure Local stacks.

Azure Local MOC starts MAC address assignment using these 4 letters first: 02-ec Rest of the MAC address is calculated, and should ensure no MAC address reuse is possible.

However, I have seen real life example of that not being true if one performs the scenario I described earlier in this section.

Conclusion: use one management VLAN per Azure Local stack. Use subnetting to have segmented your network into smaller ranges.

You can start with a 10.0.0.0/8 IP range.

If you have an Azure Local stack with 2 nodes you must have the following amount of IP addresses:

  • 1 for network address (standard first reserved IP)
  • 1 for default gateway
  • 1 for broadcast traffic (standard last reserved IP)
  • 1 per node
  • 6 addresses for Azure Local infrastructure

Consider leaving IP addresses in spare, I would go with a /28 subnet.

So my first subnet would be:

  • 10.0.0.0/28

And my next subnets would be:

  • 10.0.0.16/28
  • 10.0.0.32/28
  • 10.0.0.48/28
  • 10.0.0.64/28

And so on.

RDMA (Remote Direct Memory Access) support on network adapter cards is essential when working with Storage Spaces Direct (S2D) in Azure Stack HCI or Azure Local environments because it enables low-latency, high-throughput communication between nodes with minimal CPU overhead. RDMA allows data to be transferred directly between the memory of different servers without involving the CPU, significantly improving performance for storage traffic. This is critical for S2D, which relies heavily on east-west network traffic to mirror and distribute storage across cluster nodes. Without RDMA, performance suffers due to increased latency and CPU load, reducing the efficiency and scalability of the system.

RDMA protocol: iWARP vs RoCE

If you decide to use RDMA, you need to choose between two protocols:

  • RoCE (RDMA over Converged Ethernet) — UDP-based. Requires careful QoS configuration on your Top-of-Rack switches, especially Data Center Bridging (DCB) with Priority Flow Control (PFC). RoCE is sensitive to packet loss, so QoS planning is critical during high network activity such as failovers and backup/restore operations.
  • iWARP — TCP-based. More forgiving on switch configuration since TCP handles congestion control natively. Easier to deploy but may have slightly different performance characteristics.

The important point is that your switches must support the protocol you choose. Make sure to confirm this with your hardware vendor before ordering network equipment. If your switches do not support RoCE or iWARP, you will be limited to non-RDMA storage traffic which significantly impacts performance.

Lack of RSS support (Mandatory)

RSS (Receive Side Scaling) support on network adapter cards is important when using Storage Spaces Direct (S2D) in Azure Local because it ensures efficient processing of high-volume network traffic across multiple CPU cores. S2D generates substantial east-west traffic between cluster nodes, and without RSS, all incoming traffic would be handled by a single core, creating a bottleneck. RSS distributes network processing across multiple cores, improving throughput and reducing latency. This parallel processing is essential for maintaining consistent performance and maximizing the benefits of S2D in high-performance, hyper-converged infrastructure environments.

My go-to network adapter card

When possible, I go for Melanox. Melanox have good support for Azure Local. I have often used ConnectX-2 LX and ConnectX-4 LX.

Switchless networking for storage

In some setups I have storage intent without a switch. This is referred to as switchless and is supported for up to 4 nodes (please look at the requirement if aiming for 4 node support, because it requires 6 storage NICs per node).

However, as of writing this article, if we choose 2 node setup to begin with in a switchless setup, Microsoft currently does not support expanding to 3 or 4 nodes. This could be a key point for choosing to include a supported storage switch to begin with, because we can then go for 2 or 4 storage NICs per node and have the capability to expand to more nodes in the future.

Compute intent with DHCP IP addresses

I often see that there is DHCP enabled on compute network VLANs. That can be enabled for a valid reason for workloads running on top of the Azure Local stack. However, no need to assign any static or DHCP IP addresses (you could do a closed untagged VLAN on the compute intent with static IP addresses on each node just to enable the failover cluster to recon the network as compute, but not required) on the untagged VLAN going go the NICs on the compute intent. I like to configure only tagged VLANs on compute intent because it is only used for workloads running on top of the Azure Local stack.

Static IP addresses on management network

Azure Local cluster validation requires static IP addresses on the management network. DHCP will cause deployment failures. Configure static IPs using SCONFIG (Option 8) during the initial OS setup on each node.

While configuring networking, also make sure to disable all disconnected network adapters. Enabled but disconnected NICs can cause validation failures during cluster deployment. If you have physical ports that are not cabled, disable the adapter in the OS before starting the deployment.

Software-Defined Networking (SDN)

SDN enables zero-trust micro-segmentation using Network Security Groups on Azure Local. It is a powerful feature, but it adds significant operational complexity.

There is one critical design constraint to be aware of: SDN is a one-way door. You can choose to enable SDN after the initial deployment — it does not need to be part of day one. But once deployed, you cannot remove it. There is no supported way to uninstall SDN from an Azure Local cluster.

My recommendation is to defer SDN unless the organization is ready for the additional operational overhead. You can always add it later when the team has more maturity with the platform.

Storage

Disk controller configuration

Storage Spaces Direct requires data disks to be presented in JBOD, HBA, or pass-through mode — not hardware RAID. If the data disks are behind a RAID controller, S2D will not be able to manage them and deployment will fail.

However, the OS disk should have RAID 1 protection for redundancy. This means you typically need two separate controllers:

  • Controller 1: OS disk in RAID 1 mode (using a BOSS card or a dedicated RAID controller)
  • Controller 2: Data disks in HBA/JBOD mode for Storage Spaces Direct

You can verify that data disks are correctly presented by checking BusType:

Get-PhysicalDisk | Select-Object FriendlyName, BusType, MediaType, Size

If BusType shows RAID, Storage Spaces Direct will not work with those disks. It should show SAS, SATA, or NVMe.

HINT This is primarily relevant for non-certified hardware or test rigs. All certified Azure Local hardware configurations include BOSS cards or equivalent solutions for the OS disk out of the box.

Cluster Shared Volumes

Microsoft recommends us to configure 1 CSV per node in the cluster and divide workloads across these CSVs. This is also default when deploying a new stack. So if our stack consists of 4 nodes, we should have 4 CSVs created and in production mode, we should ensure that each node is owner of 1 CSV to spread ownership of CSVs across nodes.

Cache and capacity storage

Some customers goes for all flash SSD, but we can gain more speed and better price to space ratio, if we use use tiered model.

We can use a combination of NVMe, SSD, and HDD in Azure Local (Storage Spaces Direct)—this is actually a common and recommended approach to build a hybrid storage configuration that balances performance and capacity.

Here’s how it typically works:

  • NVMe drives are used as the cache tier (also called performance tier).
  • SSD and/or HDD drives are used as the capacity tier (also called capacity devices).

Benefits:

  • NVMe Cache Tier: Accelerates read/write performance by buffering data before it’s written to slower disks.
  • HDD Capacity Tier: Offers large storage capacity at lower cost per TB.
  • SSD Capacity Tier (optional): Used when you want faster capacity storage but without the high cost of all-NVMe.

Key Considerations:

  • All nodes in the cluster should have a consistent drive layout.
  • NVMe cache is especially beneficial in write-intensive or latency-sensitive workloads.
  • S2D automatically manages tiering between cache and capacity layers.

Typical Example:

A 2-tier configuration might look like:

  • 2 × NVMe drives (cache)
  • 4 × HDDs or SATA SSDs (capacity)

This setup is fully supported and optimized in Azure Local for performance and efficiency.

Cluster sizing

Single-node clusters

Azure Local supports single-node clusters, but they are not recommended for production workloads. Azure Local is designed with the assumption of 24/7 uptime and redundancy — similar to how Azure datacenters operate. A single-node cluster breaks this model:

  • No redundancy during updates or hardware failures
  • No failover capability for virtual machines
  • The Azure Portal experience degrades during maintenance windows since the only node is being serviced

Single-node clusters are suitable for testing, labs, and proof-of-concept environments. For production workloads, deploy a minimum of 2 nodes to maintain service availability during updates and failures.

Azure resource design

Subscription and resource group strategy

When planning the Azure-side resource structure for Azure Local, keep in mind that Microsoft limits you to one custom location per subscription and one custom location per Azure Local cluster. Resources also cannot be moved between subscriptions or resource groups after creation.

My recommendation is:

  • One subscription per Azure Local cluster — each subscription represents a datacenter site
  • One resource group per cluster — contains all cluster resources

This approach gives clean separation between sites, simplified billing per location, and easy decommissioning — if a cluster is retired, you can clean up the entire resource group or subscription without affecting other environments.

The key takeaway is to plan your resource group and subscription strategy upfront. Reorganizing later is not supported.

Designs

Design 1: – 2 node setup with storage switch

Design 2: – 2 node setup with switchless storage