Appearance
Manager Node Failover
The Yellowbrick appliance was designed with redundancy in mind and has no single point of failure. Furthermore, the system can sustain multiple component failures and automatically fail over to redundant hardware that is detected to be operational. Depending on which component fails, the failover process may be fully transparent with no downtime, or it may require a small disruption to services (<60 seconds). Most failures are handled transparently with no disruption.
Failover Scenarios
You can inspect the system at any time to check the health of its components. You can use the System Management Console (SMC) or the Yellowbrick command line interface (ybcli) for this purpose.
Should a catastrophic event occur that causes a failover between the manager nodes, the victim of that process is taken out of the cluster and put into standby mode if it is still responding. Once the failure condition has been cleared, you can bring the manager node back into the cluster manually. This process is described in this section.
During system maintenance, you should put the cluster into maintenance mode. You can do this either via the SMC or by using the following ybcli
command:
system maintenance on
This command stops the cluster from responding to changes in the environment that would otherwise cause node failover to happen. It will also shut down the Yellowbrick database stack. When maintenance is complete, you can bring the system back to normal operating mode by running the following ybcli
command:
system maintenance off
HA Cluster Status
Run the following ybcli
command at any time to check the status of the HA cluster:
system status
This command returns the status of various resources in the system. All of these resources always run on the same manager node: the primary manager node. For example:
YBCLI (PRIMARY)> system status
Cluster nodes configured: 2
----------------------------
Node 1 (PRIMARY - LOCAL NODE ) : yb100-mgr0.yellowbrickroad.io -> ONLINE
Node 2 (SECONDARY - REMOTE NODE ) : yb100-mgr1.yellowbrickroad.io -> ONLINE
Yellowbrick database running: YES
Yellowbrick database ready : YES
Blade parity : Enabled
Blade parity rebuilding : NO
Maintenance mode : NO
Software update in-progress : NO
Floating system IP : 10.22.110.10/24
In this example, the yb100-mgr0
node is currently the primary node running all database resources. Besides yb100-mgr0
, the other manager node, with a hostname of yb100-mgr1
, is also online and ready to take over.
In other words, yb100-mgr0
is the primary manager node and is serving all user requests; yb100-mgr1
is the secondary manager node and will automatically take over resources if yb100-mgr1
becomes unavailable.
Manual Failover
If it becomes necessary to manually fail over from one manager node to the other, you can use the ybcli
to complete the task.
- Start the
ybcli
as theybdadmin
user on either the primary or secondary manager node. - Run the
system status
command to verify that the secondary manager node is present. - Run the
system failover
command to perform a manual failover.
Note: For a manual failover, it can take 2 minutes before the system responds again at the floating IP address.
It is important to log into the primary node directly when running the system failover
command because the floating IP will briefly be taken down as it moves between the nodes.
Using the yb100-mgr0
and yb100-mgr1
systems, failing over to yb100-mgr1
would look like this:
Initiating system failover
Monitoring completion. This can take 2 minutes. Notifications may appear. Standby...
System failover was successful. Yellowbrick database started.
Primary manager node is now: Remote node (yb100-mgr1.yellowbrick.io)
WARNING: A SYSTEM NODE ROLE CHANGE WAS DETECTED
Current roles
–------------
LOCAL NODE : SECONDARY (ACTIVE)
REMOTE NODE : PRIMARY (ACTIVE)
Now the yb100-mgr1
node is running all resources. Once the failover is complete, you can fail back to the original node, which in this case was yb100-mgr0
. ybcli
will automatically determine when it is safe to do so. If the system is not capable of failing over, ybcli
reports an error to explain the problem. This may happen if large amounts of data are being replicated between the nodes.
Use the system status
command to determine when the process is complete.
Parent topic:System Management