3 min read
Created on

Azure Local - Kubernetes - Part 5 - Fix: Arc Data Controller stuck deploying due to insufficient memory


Intro

This article is part of a series: Navigate to series page

After deploying an Azure Arc Data Controller on my AKS Arc cluster on Azure Local (see Part 4), the data controller was stuck in a DeployingController state for over 2 hours. The Azure Portal kept showing that the resource was “currently being created” with no progress.

In this article I will walk through how I identified the root cause — insufficient memory on the worker node — and how I fixed it by scaling the node pool.

The symptoms

After completing the data controller deployment wizard in the Azure Portal, the resource appeared in the Azure Arc data controllers list but the status remained Deploying. Even after waiting 2 hours, nothing changed.

Checking the data controller status from the CLI confirmed it was stuck:

az resource show \
  --name arcdc-k8s-azhcickj4 \
  --resource-group rg-ckj-azl-lab-westeurope \
  --resource-type "Microsoft.AzureArcData/dataControllers" \
  --query "properties.k8sRaw.status" -o json

The state showed DeployingController instead of Ready.

Identifying the root cause

I started by checking the pods in the arc-data-services namespace:

kubectl get pods -n arc-data-services

This revealed two problematic pods:

  • controldb-0 — stuck in Pending (0/1 ready) for over 24 hours
  • control-4wvzg — had restarted 143 times in a crash loop, because it depends on controldb

The other pods (bootstrapper, metricsdb-0, etc.) were all running fine. The PVCs were also all Bound, so storage was not the issue:

kubectl get pvc -n arc-data-services

To find out why controldb-0 could not be scheduled, I described the pod:

kubectl describe pod controldb-0 -n arc-data-services

The Events section at the bottom showed the scheduling failure:

0/2 nodes are available: 1 Insufficient memory, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }

This told me two things:

  1. The control plane node will not accept workloads because it has a NoSchedule taint (this is expected and correct)
  2. The worker node does not have enough free memory for the controldb pod’s 4Gi memory request

I confirmed this by checking the worker node’s allocatable resources:

kubectl describe node <worker-node-name>

The worker node (VM size Standard_A4_v2) had only ~6Gi allocatable memory. Between the bootstrapper, control pod, and other system pods already running, there was not enough left for controldb’s 4Gi request.

The fix — scale the node pool

Since my cluster only had a single worker node with Standard_A4_v2 (4 vCPUs, 8 GiB RAM, ~6Gi allocatable), the simplest fix was to add a second worker node to spread the workload:

az aksarc nodepool scale \
  --name nodepool1 \
  --cluster-name k8s-azhcickj4 \
  --resource-group rg-ckj-azl-lab-westeurope \
  --node-count 2

After the second node joined the cluster, the Kubernetes scheduler automatically placed controldb-0 on the new node. Within 5–15 minutes, controldb was running, the control pod stopped crash-looping, and the data controller state transitioned from DeployingController to Ready.

No manual pod restarts or other intervention was needed — the scheduler handled everything once capacity was available.

Final remark

The Azure Arc Data Controller has significant memory requirements. The controldb pod alone requests 4Gi of memory. On a Standard_A4_v2 worker node with ~6Gi allocatable, there is not enough room for controldb alongside the other Arc data services pods.

If you are planning to run Arc data services on AKS Arc, make sure your worker nodes have at least 16Gi RAM — or run multiple smaller workers so the scheduler can spread the load. In a lab environment, scaling the node pool to 2 nodes is the quickest path forward.