On the manager node (to be) initialize the docker swarm cluster – “docker swarm init”. It also prints the command to run on the worker node to join the cluster. If we lose the command then run “docker swarm join-token worker” on the manager node to get it and run on the nodes to join the cluster as worker nodes.
Run “docker swarm join-token manager” to get the command and run on the nodes to join the cluster as manager nodes.
Now lets run on the nodes to join cluster as manager nodes..
By default manager nodes also take tasks like worker nodes but we can disable it by “draining” the node.
I’ve added one more node as manager node as recommended by docker : https://docs.docker.com/engine/swarm/admin_guide/#add-manager-nodes-for-fault-tolerance
As we can see ip-172-31-41-147 is the manager node and also “Leader” which distribute tasks to worker nodes.
Scenario 1: Stopping “Leader” nodes one by one.
Stopping the instance ip-172-31-41-147.. ip-172-31-41-147 is marked as “Unreachable” and ip-172-31-35-29 marked as “Leader”.
Though bringing the instance back online wont change the “Leader” but we can still manage the cluster using ip-172-31-35-29.
Now also stopping ip-172-31-35-29.. not able to run cluster commands any more from ip-172-31-18-175
Recover from losing the quorum : https://docs.docker.com/engine/swarm/admin_guide/#recover-from-losing-the-quorum
After forcing the new cluster below is the status of nodes..
We can manage the existing services..
Now I’ve started ip-172-31-35-29 and as expected not able to run any cluster commands. So made the node to leave the old cluster and joined back the cluster as manager node.
But now there are two id’s for the same node ip-172-31-35-29 ( old and new one ) so remove the old one.
Scenario 2: Stopping “Non-Leader” nodes.. we can still manage cluster with single manager node
As mentioned in docker document “For instance, whether you have 3 or 4 managers, you can still only lose 1 manager and maintain the quorum. If you have 5 or 6 managers, you can still only lose two.”
Scenario 3: Losing all manager nodes.. but to restore we need to have back up of /var/lib/docker/swarm of previous manager node. Reference: https://docs.docker.com/engine/swarm/admin_guide/#recover-from-disaster
I’ve provisioned new node ip-172-31-12-118 and made the docker running.
Copied the backup and restored on the new node.
Start docker swarm with force new cluster..
Now cluster can recognize the nodes and existing services but still cannot manage using new manager node.
On one of the worker nodes two instances of “web” service is running
We cannot join back the worker node to the cluster as it is already part of the old swarm cluster. Leaving the swarm also deleted the “web” service instances from the worker node.
As soon as I’ve joined the node to the new cluster as worker node – “web” service started again.