Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

enjoy-binbin · 2024-09-30T04:16:25Z

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

This closes #939.

When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. This closes valkey-io#939. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

codecov · 2024-09-30T04:31:34Z

Codecov Report

Attention: Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.71%. Comparing base (0c49053) to head (df0ef8d).

Files with missing lines	Patch %	Lines
src/server.c	91.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1091      +/-   ##
============================================
+ Coverage     70.61%   70.71%   +0.09%     
============================================
  Files           114      114              
  Lines         61717    61737      +20     
============================================
+ Hits          43583    43657      +74     
+ Misses        18134    18080      -54

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.19% <100.00%> (-0.10%)`	⬇️
src/config.c	`78.69% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/server.c	`88.67% <91.66%> (+0.01%)`	⬆️

... and 17 files with indirect coverage changes

zuiderkwast

Nice! Thanks for doing this.

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

This is the sequence:

SHUTDOWN or SIGTERM calls prepareForShutdown(). Here, pause clients for writing and start waiting for replicas offset.
In serverCron(), we check isReadyToShutdown() which checks if all replicas have repl_ack_off == primary_repl_offset. If yes, finishShutdown() is called, otherwise wait more.
finishShutdown.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

What do you think?

zuiderkwast · 2024-10-01T14:38:53Z

src/server.c

+    if (server.auto_failover_on_shutdown && server.cluster_enabled && best_replica) {
+        /* Sending a CLUSTER FAILOVER FORCE to the best replica. */
+        const char *buf = "*3\r\n$7\r\nCLUSTER\r\n$8\r\nFAILOVER\r\n$5\r\nFORCE\r\n";
+        if (connWrite(best_replica->conn, buf, strlen(buf)) == (int)strlen(buf)) {


This is in finishShutdown(), just before the primary does exit().

Is there a risk that the written command has not yet been fully sent to the replica when the primary exits? If not, can we do it in an earlier shutdown stage and make sure the replica has received the command? The replica doesn't send any OK reply to the primary on the replication stream, but maybe the primary could wait for the failover auth request from the replica on the cluster bus before the primary shuts down?

Btw, maybe this code should be in clusterHandleServerShutdown() which is called a few lines below:

/* Handle cluster-related matters when shutdown. */ if (server.cluster_enabled) clusterHandleServerShutdown();

There may be this risk, i am not sure of the details, i thought connWrite would have this guarantee. Waiting the failover auth seems like a good idea, the waiting time is also determined by shutdown-timeout? In this case, i am worried that during the waitting period, the primary got a reconfigurting a become a replica, can this happen?

clusterHandleServerShutdown. right, i forget it, it is a good place to hold this logic.

zuiderkwast · 2024-10-01T14:40:29Z

src/server.c

+    }
+
+    if (server.auto_failover_on_shutdown && server.cluster_enabled && !best_replica) {
+        serverLog(LL_WARNING, "Unable to find a replica to perform an auto failover on shutdown.");
+    }


Probably this should be an else to the above, so we avoid repeating the same logic.

zuiderkwast · 2024-10-01T14:44:30Z

tests/unit/cluster/auto-failover-on-shutdown.tcl

+
+proc test_main {how} {
+    test "auto-failover-on-shutdown will always pick a best replica and send CLUSTER FAILOVER - $how" {
+        set primary [srv 0 client]
+        set replica1 [srv -3 client]


This function requires some nodes to be primary and some to be replica. Please add some comment about how the cluster needs to be initiated for this function to work.

zuiderkwast · 2024-10-01T14:46:45Z

tests/unit/cluster/auto-failover-on-shutdown.tcl

+        # Wait for the replica2 to become a primary.
+        wait_for_condition 1000 50 {
+            [s -6 role] eq {master}


Can we make sure we wait shorter time than the node timeout, so we actually test that this is faster than an automatic failover? This is 50 seconds, right? We can maybe reduce it to 10 seconds and we can set node timeout of the nodes to for example 20 seconds when we start them?

We actually have verify_log_message later to check the logs

zuiderkwast · 2024-10-01T14:49:11Z

src/cluster_legacy.c

+            /* todo: see if this is needed. */
+            /* This is a failover triggered by my primary, let's counts its vote. */
+            if (server.cluster->mf_is_primary_failover) {
+                server.cluster->failover_auth_count++;
+            }


It's not very good that we add some conditional logic about how we are counting the votes. It makes the voting algorithm more complex to analyze.

I understand we need this special case if the primary exits immediately after sending CLUSTER FAILOVER FORCE. Maybe we can avoid it if the primary waits for the failover auth requests and actually votes, before the primary shuts down..?

yes, it is not very good so i add a todo. I am also not sure about this logic, and in fact we can skip this logic and the failover will also work since i think we will still have the enouth votes. I add this logic just trying open the discusstion so that we can discuss it and see if should we handle this case.

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-10-06T06:34:00Z

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

The issue desc is good and very detailed so i copied it, i will update it later.

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

yean, a failover as soon as possible is good, but itn't true that the primary is down only after it actually exit? so in this case, if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

so let me sort it out again, you are suggesting that if one replica has already caught up the offset, we should trigger a failover immediately?

I guess it is also make sense in this case.

Signed-off-by: Binbin <[email protected]>

zuiderkwast · 2024-10-06T19:45:00Z

if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

enjoy-binbin · 2024-10-18T09:35:04Z

Sorry for the late reply, i somehow missed this thread.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

Yes, i think this may happen, like if the primary does not flush its output buffer to the slow replica, like primary does not write the buffer to the slow replica, when doing the reconfiguration, the slow replica may use an old offset to psync with the new primary, which will cause a full sync. This may happen, but the probability should be small since the primary will call flushReplicasOutputBuffers to write as much as possible before shutdown.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

zuiderkwast · 2024-10-18T09:54:41Z

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

I'm thinking if there are any corner cases, like if the cluster is too small to have quorum without the shutting down primary...

If it is simple, I prefer to let the primary wait and vote. Then we can avoid the server.cluster->mf_is_primary_failover variable. I don't like this variable and special case. :)

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

enjoy-binbin requested a review from zuiderkwast September 30, 2024 04:16

enjoy-binbin force-pushed the shutdown_failover branch from 42dc65e to 6ab8888 Compare September 30, 2024 04:16

fix typo

4b49f03

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 30, 2024

zuiderkwast reviewed Oct 1, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

f9ca731

Signed-off-by: Binbin <[email protected]>

add comment in the test

df0ef8d

Signed-off-by: Binbin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

enjoy-binbin commented Sep 30, 2024

codecov bot commented Sep 30, 2024 •

edited

Loading

zuiderkwast left a comment

zuiderkwast Oct 1, 2024

enjoy-binbin Oct 6, 2024

zuiderkwast Oct 1, 2024

zuiderkwast Oct 1, 2024

zuiderkwast Oct 1, 2024

enjoy-binbin Oct 6, 2024

zuiderkwast Oct 1, 2024

enjoy-binbin Oct 6, 2024

enjoy-binbin commented Oct 6, 2024

zuiderkwast commented Oct 6, 2024

enjoy-binbin commented Oct 18, 2024 •

edited

Loading

zuiderkwast commented Oct 18, 2024

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Are you sure you want to change the base?

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Conversation

enjoy-binbin commented Sep 30, 2024

codecov bot commented Sep 30, 2024 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enjoy-binbin commented Oct 6, 2024

zuiderkwast commented Oct 6, 2024

enjoy-binbin commented Oct 18, 2024 • edited Loading

zuiderkwast commented Oct 18, 2024

codecov bot commented Sep 30, 2024 •

edited

Loading

enjoy-binbin commented Oct 18, 2024 •

edited

Loading