[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

zuiderkwast · 2024-08-24T05:01:51Z

The problem/use-case that the feature addresses

When a primary disappears, its slots are not served until an automatic failover happens. It takes about 3 seconds (node timeout plus some second). It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default.

Description of the feature

When a primary receives a SIGTERM, let it trigger a failover to one of the replicas as part of the graceful shutdown.

Alternatives you've considered

Our current solution is to have a wrapper process, a small script, starting valkey. Then, this process receives the SIGTERM and can handle this. It's like a workaround though. It's better to have this built in.

Additional information

We use cluster.

enjoy-binbin · 2024-08-24T10:01:04Z

This is a good idea. In my fork, i have a similar feature like this (detect a disk error and do failover). The basic idea is that, the primary will pick a best replica and send a CLUSTER FAILOVER to it (and wait for it). Do you like this approach or need me to try it?

zuiderkwast · 2024-08-24T20:28:01Z

the primary will pick a best replica and send a CLUSTER FAILOVER to it (and wait for it). Do you like this approach

Yes, this is strait-forward. I like it.

I have another idea, similar but maybe faster(?) but more complex(?). This is the idea: The primary first pauses writes, then waits for the replica to replicate everything and then sends CLUSTER FAILOVER FORCE. This avoids step 1 below. This is from the docs of CLUSTER FAILOVER:

The replica tells the master to stop processing queries from clients.

The master replies to the replica with the current replication offset.

The replica waits for the replication offset to match on its side, to make sure it processed all the data from the master before it continues.

The replica starts a failover, obtains a new configuration epoch from the majority of the masters, and broadcasts the new configuration.

The old master receives the configuration update: unblocks its clients and starts replying with redirection messages so that they’ll continue the chat with the new master.

And for FORCE:

If the FORCE option is given, the replica does not perform any handshake with the master, that may be not reachable, but instead just starts a failover ASAP starting from point 4. This is useful when we want to start a manual failover while the master is no longer reachable.

enjoy-binbin · 2024-08-25T03:51:11Z

The primary first pauses writes, then waits for the replica to replicate everything and then sends CLUSTER FAILOVER FORCE.

yeah, this seems ok to me, faster.

This different i think is, one is that the replica thinks the offset is ok and start the failover, and the other is that the primary tells the replica that it can start the failover.

The CLUSTER FAILOVER one:

Primary detect a SIGTERM and then pick a best replica to send CLUSTER FAILOVER in serverCron, 100ms a time.
Replica receives a CLUSTER FAILOVER and tells primary to stop the write (and repsonse the offset), and wait the offset become ok. (in clusterCron, 100ms a time)
Replica start the failover

primary serverCron, primary send CLUSTER FAILOVER, replcia send a MFSTART, primary send a PING, replica clusterCron and start the failover.
100ms + a command + a mfstart + a ping + 100ms

The CLUSTER FAILOVER FORCE one:

Primary detect a SIGTERM and then stop the write, and then send the REPLCONF GETACK to all replicas and wait the response.
Primary receives the REPLCONF ACK, check replica->repl_ack_off and primary_repl_offset, if match, send the CLUSTER FAILOVER FORCE to the replica.
Replica start the failover.

primary serverCron, primary send REPLCONF GETACK, replica send REPLCONF ACK, primary send CLUSTER FAILOVER FORCE, replica clusterCron and start the failover
100ms + a command + a command + a command + 100ms

zuiderkwast · 2024-08-26T11:59:20Z

In shutdown, we already have a feature to stop writes and wait for replicas before shutting down. These two features will need to be combined. Let's discuss the details in a PR. :) Do you want to implement this?

enjoy-binbin · 2024-08-26T13:55:27Z

yeah, i can do it in this week.

enjoy-binbin · 2024-09-03T08:11:09Z

i dont have enouth time to write the test right now, i guess you might want to take a look at it in advance, so here is the commit enjoy-binbin@9777d01

i did a small manual testing locally and it seems to be work, i will try to find time to finish the test code and the rest of it.

zuiderkwast · 2024-09-09T18:27:40Z

Don't worry. I hope we can have it for 8.2 so you have a few months to finish it. 😸

When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. This closes valkey-io#939.

When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. This closes valkey-io#939. Signed-off-by: Binbin <[email protected]>

enjoy-binbin linked a pull request Sep 30, 2024 that will close this issue

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

zuiderkwast commented Aug 24, 2024

enjoy-binbin commented Aug 24, 2024

zuiderkwast commented Aug 24, 2024

enjoy-binbin commented Aug 25, 2024

zuiderkwast commented Aug 26, 2024

enjoy-binbin commented Aug 26, 2024

enjoy-binbin commented Sep 3, 2024 •

edited

Loading

zuiderkwast commented Sep 9, 2024

[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

[NEW] Trigger manual failover on SIGTERM to primary (cluster) #939

Comments

zuiderkwast commented Aug 24, 2024

enjoy-binbin commented Aug 24, 2024

zuiderkwast commented Aug 24, 2024

enjoy-binbin commented Aug 25, 2024

zuiderkwast commented Aug 26, 2024

enjoy-binbin commented Aug 26, 2024

enjoy-binbin commented Sep 3, 2024 • edited Loading

zuiderkwast commented Sep 9, 2024

enjoy-binbin commented Sep 3, 2024 •

edited

Loading