Anaconda Enterprise (AE) uses container-based architecture and cluster functionality to provide built-in high availability and fault tolerance for core AE services (e.g., repository, authentication, and UI), core Kubernetes services (e.g., Kubernetes API server, etcd cluster members), and user sessions and deployments.
High availability for all of these services is automatically configured in an AE cluster that contains three or more nodes. Additional nodes can be added to an existing AE cluster at any time, which makes it easy to enable high availability at any point after the initial installation and expand the AE cluster on-demand.
High Availability Architecture Diagram
In the above high availability architecture diagram, the AE Master (Node 1) is running containerized AE services that write persistent storage to disk as well as core Kubernetes services, the AE Kubernetes Masters (Nodes 2 and 3) are running replicates of core Kubernetes services, and the AE Workers (Nodes 4, 5, and greater) are running containerized AE core services as well as user sessions and deployments.
Note that the containerized AE core services, sessions, and user deployments will run on any available AE Worker node, and the diagram above is only one example of where those services will be running. All of the load balancing and network traffic routing between nodes is handled internally by AE.
The following sections provide additional details on how to maintain operation of AE in the event of a failure or network issue with any of the AE cluster nodes.
AE Master Node Failure
In the event that the AE Master node fails, a new AE Master node can be provisioned and restored from the most recent AE backup. The mean time to recovery (MTTR) for an AE Master node is 30 minutes. Refer to the AE documentation for more information about the backup and restore procedure.
AE Kubernetes Master Node Failure
In the event that an AE Kubernetes Master node fails, the AE cluster will continue to operate without interruption. The AE cluster can be restored to a healthy state by provisioning a new AE Kubernetes Master node, which will recover the original level of redundancy in terms of the number of replicate Kubernetes API servers and etcd cluster members.
AE Worker Node Failure
In the event that any AE Worker node fails, the AE cluster will continue to operate without interruption, and core AE services and user sessions/deployments will automatically be restarted on another healthy worker node in the AE cluster within 1-2 minutes. Any sessions or deployments will continue to run, and the network traffic will automatically/transparently be redirected to the newly started container and node.