Manager Node Failover
The Yellowbrick appliance was designed with redundancy in mind and has no single point of failure. Furthermore, the system can sustain multiple component failures and automatically fail over to redundant hardware that is detected to be operational. Depending on which component fails, the failover process may be fully transparent with no downtime, or it may require a small disruption to services (<60 seconds). Most failures are handled transparently with no disruption.
Failover ScenariosYou can inspect the system at any time to check the health of its
components. You can use the System Management Console (SMC) or the Yellowbrick command line interface
(ybcli
) for this purpose.
Should a catastrophic event occur that causes a failover between the manager nodes, the victim of that process is taken out of the cluster and put into standby mode if it is still responding. Once the failure condition has been cleared, you can bring the manager node back into the cluster manually. This process is described in this section.
ybcli
command:system maintenance on
ybcli
command:system maintenance off
ybcli
command at any time to
check the status of the HA cluster:system status
YBCLI (PRIMARY)> system status
Cluster nodes configured: 2
----------------------------
Node 1 (PRIMARY - LOCAL NODE ) : yb100-mgr0.yellowbrickroad.io -> ONLINE
Node 2 (SECONDARY - REMOTE NODE ) : yb100-mgr1.yellowbrickroad.io -> ONLINE
HW Platform : Tinman
Database system running : YES
Database system ready : YES (9/9 active/installed blades)
Database system read-only : NO
...
In
this example, the yb100-mgr0
node is currently the primary node running all
database resources. Besides yb100-mgr0
, the other manager node, with a
hostname of yb100-mgr1
, is also online and ready to take over.
In
other words, yb100-mgr0
is the primary manager node and is serving all user
requests; yb100-mgr1
is the secondary manager node and will automatically
take over resources if yb100-mgr1
becomes unavailable.
If it becomes necessary to manually fail
over from one manager node to the other, you can use the ybcli
to complete
the task.
- Start the
ybcli
as theybdadmin
user on either the primary or secondary manager node. - Run the
system status
command to verify that the secondary manager node is present. - Run the
system failover
command to perform a manual failover.Note: For a manual failover, it can take 2 minutes before the system responds again at the floating IP address.
It is important to log into the primary node directly when running the system
failover
command because the floating IP will briefly be taken down as it moves
between the nodes.
yb100-mgr0
and
yb100-mgr1
systems, failing over to yb100-mgr1
would
look like
this:Initiating system failover
Monitoring completion. This can take 2 minutes. Notifications may appear. Standby...
System failover was successful. Yellowbrick database started.
Primary manager node is now: Remote node (yb100-mgr1.yellowbrick.io)
WARNING: A SYSTEM NODE ROLE CHANGE WAS DETECTED
Current roles
–------------
LOCAL NODE : SECONDARY (ACTIVE)
REMOTE NODE : PRIMARY (ACTIVE)
Now
the yb100-mgr1
node is running all resources. Once the failover is
complete, you can fail back to the original node, which in this case was
yb100-mgr0
. ybcli
will automatically determine when it
is safe to do so. If the system is not capable of failing over, ybcli
reports an error to explain the problem. This may happen if large amounts of data are being
replicated between the nodes.
Use the system status
command to
determine when the process is complete.