When your Kubernetes cluster needs more capacity, use Kubespray’s scale.yml playbook to add a worker node. This guide covers the standard procedure plus real-world troubleshooting.
[01] Environment
| Item |
Value |
| Kubespray |
inventory/mycluster/ based |
| Kubernetes |
v1.28.0 |
| CRI |
containerd 1.7.22 |
| CNI |
Calico |
| OS |
Ubuntu 22.04 (Jammy) |
| Ansible |
Python venv environment |
1-1. Existing Cluster
| Role |
Hostname |
IP |
| control-plane + etcd |
k8s-master |
192.168.1.91 |
| worker |
k8s-worker1 |
192.168.1.92 |
| worker |
k8s-worker2 |
192.168.1.93 |
| worker (special HW) |
node-a |
192.168.1.111 |
| worker (special HW) |
node-b |
192.168.1.113 |
We’re adding k8s-worker3 (192.168.1.94).
[02] Overall Flow
1
2
3
4
5
6
7
8
9
|
[1] Add the node to inventory
↓
[2] SSH public key + NOPASSWD sudo
↓
[3] Ansible connectivity test (ping)
↓
[4] Run scale.yml
↓
[5] Verify with kubectl get nodes
|
[03] Step 1 — Add the Node to Inventory
Edit inventory/mycluster/inventory.yaml and add the entry in both all.hosts and children.kube_node.hosts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
all:
hosts:
k8s-master:
ansible_host: 192.168.1.91
k8s-worker1:
ansible_host: 192.168.1.92
k8s-worker2:
ansible_host: 192.168.1.93
k8s-worker3: # ← add
ansible_host: 192.168.1.94 # ← add
children:
kube_node:
hosts:
k8s-worker1:
k8s-worker2:
k8s-worker3: # ← add
|
If you add to hosts but forget kube_node.hosts, the host definition exists but isn’t part of the group — scale.yml will skip it.
[04] Step 2 — SSH Public Key + NOPASSWD sudo
Kubespray uses become: yes for sudo, so both must be configured.
4-1. Register SSH Public Key
1
|
ssh-copy-id user@192.168.1.94
|
Verify:
1
2
|
ssh 192.168.1.94 'hostname'
# Should print hostname without password prompt
|
On the new server:
1
2
3
|
ssh 192.168.1.94 \
"echo 'user ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/user && \
sudo chmod 440 /etc/sudoers.d/user"
|
Verify:
1
2
|
ssh 192.168.1.94 'sudo -n whoami'
# → should print 'root'
|
sudo -n suppresses the password prompt. If you see a password is required, NOPASSWD is not configured.
[05] Step 3 — Ansible Connectivity Test
Before running the playbook, always confirm connectivity with ping.
1
2
|
cd ~/kubespray/kubespray
../venv/bin/ansible -i inventory/mycluster/inventory.yaml k8s-worker3 -m ping
|
Expected response:
1
2
3
4
5
6
7
|
k8s-worker3 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3"
},
"changed": false,
"ping": "pong"
}
|
[06] Step 4 — Run scale.yml
1
2
3
4
5
6
7
|
LOG=~/kubespray/logs/scale-$(date +%Y%m%d-%H%M%S).log
../venv/bin/ansible-playbook \
-i inventory/mycluster/inventory.yaml \
scale.yml \
-b \
--limit=k8s-worker3 \
> "$LOG" 2>&1 &
|
| Option |
Description |
-b |
become (sudo) |
--limit=k8s-worker3 |
Apply playbook to the new node only — no impact on existing workers |
> "$LOG" 2>&1 & |
Background execution, save log to file |
Monitor progress:
[07] Troubleshooting — APT Version 404 Error
7-1. Error Message
1
2
3
4
5
6
|
TASK [kubernetes/preinstall : Install packages requirements] ***
fatal: [k8s-worker3]: FAILED! =>
"msg": "'/usr/bin/apt-get -y ...
install 'apt-transport-https=2.4.13' ...' failed:
E: Failed to fetch .../apt-transport-https_2.4.13_all.deb
404 Not Found"
|
7-2. Root Cause
Ansible fact cache remembered an old package version.
ansible.cfg had fact caching enabled:
1
2
3
4
|
[defaults]
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 86400
|
The previously cached apt-transport-https=2.4.13 was no longer in the Ubuntu repo — it had been replaced with 2.4.14, causing 404 Not Found.
1
2
3
4
|
# Confirmed on the new server — only 2.4.14 exists
apt-cache madison apt-transport-https | head -3
# apt-transport-https | 2.4.14 | ... jammy-updates
# apt-transport-https | 2.4.5 | ... jammy
|
7-3. Resolution
1
2
3
4
5
|
# On the control node — delete the fact cache
rm -rf /tmp/k8s-worker3
# On the new server — force APT metadata refresh
ssh 192.168.1.94 'sudo apt-get update'
|
Re-running the same scale.yml command completes successfully in one pass.
Key lesson: When “a procedure that worked before suddenly fails”, the answer is usually the cache sees a different world than reality. Ansible fact cache, APT metadata, and Docker image digests all suffer this.
[08] Step 5 — Verify Results
8-1. Ansible PLAY RECAP
1
2
|
PLAY RECAP ************************************************************
k8s-worker3 : ok=490 changed=77 unreachable=0 failed=0 skipped=772
|
Confirm failed=0.
8-2. Kubernetes Node Status
1
|
kubectl get nodes -o wide
|
1
2
3
4
5
6
7
|
NAME STATUS ROLES AGE VERSION INTERNAL-IP
k8s-master Ready control-plane 132d v1.28.0 192.168.1.91
k8s-worker1 Ready <none> 132d v1.28.0 192.168.1.92
k8s-worker2 Ready <none> 132d v1.28.0 192.168.1.93
k8s-worker3 Ready <none> 42s v1.28.0 192.168.1.94 ← new
node-a Ready <none> 23h v1.28.0 192.168.1.111
node-b Ready <none> 23h v1.28.0 192.168.1.113
|
The new node joined successfully as Ready. Calico CNI is deployed too, so pod scheduling works.
8-3. (Optional) Test Pod Scheduling
1
2
3
4
5
6
7
8
|
kubectl run test-nginx --image=nginx:alpine \
--overrides='{"spec":{"nodeSelector":{"kubernetes.io/hostname":"k8s-worker3"}}}'
kubectl get pod test-nginx -o wide
# Confirm NODE is k8s-worker3
# Clean up after testing
kubectl delete pod test-nginx
|
[09] Command Reference
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# On the control node
../venv/bin/ansible -i inventory/mycluster/inventory.yaml <node> -m ping
../venv/bin/ansible-playbook -i inventory/mycluster/inventory.yaml scale.yml -b --limit=<node>
rm -rf /tmp/<node> # delete fact cache
# On the target server
sudo apt-get update # refresh APT metadata
sudo -n whoami # verify NOPASSWD sudo
# On the master
kubectl get nodes -o wide
kubectl describe node <node>
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
|
[10] Summary
| Step |
Task |
Key point |
| 1 |
Inventory registration |
Add to both hosts + kube_node.hosts
|
| 2 |
SSH / sudo |
ssh-copy-id + NOPASSWD required |
| 3 |
Connectivity test |
Pre-check with ansible -m ping
|
| 4 |
scale.yml |
Use --limit=<node> to target only the new node |
| 5 |
Verification |
PLAY RECAP + kubectl get nodes double check |
| (trouble) |
APT 404 |
Delete fact cache + apt-get update
|