Address non-monotonicity of steal time and other issues #2390

travisdowns · 2024-08-15T00:48:32Z

This series primarily addresses the problem that on systems with low amounts of steal, steal time appears negative (i.e., the cumulative steal time counter goes down from sample to sample).

This wrong on the face of it and also causes serious problems as a metric in prometheus since the counter contract (monotonic increase) is violated. This causes spurious "counter reset" detection in prometheus and hence bogus very large or very small steal time results in rate (or similar) queries.

This is addressed in two ways:

We make the sleep time calculation more accurate, which is the underlying reason for negative steal which reduces the error (and so "negativeness") of steal by a couple orders of magnitude. After this change steal time is often 0 when rounded to the nearest ms where it wasn't before.
Because the reduction above still does not prevent small negative steal completely, we change the implementation of the metric to essentially cap steal from below 0 in periods where steal was negative.

The individual changes have further details. I am open to splitting commits that may be less popular or require more discussion into a different PR if it makes sense.

avikivity · 2024-08-18T16:00:03Z

include/seastar/core/reactor.hh

@@ -578,6 +578,8 @@ public:

    steady_clock_type::duration total_idle_time();


awake-busy
awake-idle
sleeping

Most propitious, we were just discussing this on our own customer issue.

avikivity · 2024-08-18T16:15:16Z

src/core/reactor_backend.cc

        r = io_pgetevents(_polling_io.io_context, 1, batch_size, batch, tsp, active_sigmask);
+        if (may_sleep) {
+            _r._total_sleep += sched_clock::now() - before_getevents;
+        }


How come only one backend gets this adjustment?

Probably because it's the one we use currently. This change is from a year or more ago and was dusted off. Let me check the other backends.

Right, it's slowly coming back: updating the other reactor backends was part of why I never upstreamed this in the first place. Doing that now.

In the last push this is fixed for all 3 reactor backends.

avikivity · 2024-08-18T16:24:17Z

The changes look reasonable, but isn't steal time terminally broken?

thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny, but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push.

travisdowns · 2024-08-19T20:13:33Z

thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny

Yes exactly, that is the underlying flaw here which is only improved quantitatively here but not eliminated. Another "bug" is that we can't actually even tell if we are going to sleep in the system call: may_sleep can be true, but we can find IO right away in which case the getevents call will return and we did not sleep at all but all this work got counted as sleep and this problem can repeat at high frequency for some IO patterns. However, in practice steal has been quite useful for us in a variety of incidents, even in its flawed state.

However, for aio specifically this adjustment of the timing points provided good results in practice. I don't have good numbers here since this is an old change but based on my long-ago testing pushing down the sleep calculation reduced negative steal to a very small amount, often rounding to 0 (in ms) over the sample period. If there was almost any "true" steal it is now almost always enough to overwhelm the negative steal part. In practice under load we almost always see significant "true" steal even if nothing else is running due to the syscall thread which is constantly working to handle EAGAIN IOs :(.

So this is why we need the second part of this change which simply discards negative steal periods (note that is is different from simply preventing the steal counter from going down).

but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push.

Is that true? Currently it looks like we use two calls still in the uring backend: io_uring_submit to push and io_uring_wait_cqes to wait. The were some changes in this area I think order to allow a single syscall but they were reverted though even with that change this method still had the separate submit + wait calls.

travisdowns · 2024-08-20T20:04:07Z

Here are some results for a really bad case for the existing steal time calculation. This is running io_tester with this patch which prints out steal time metrics periodically and this config:

- name: steal_printer
  shards: all
  type: steal
  shard_info:
    parallelism: 1
    think_time: 10ms

- name: big_writes
  shards: all
  type: seqwrite
  shard_info:
    parallelism: 1
    reqsize: 1kB
    # shares: 10
    think_time: 0us

and running it like:

ninja io_tester && apps/io_tester/io_tester --duration=10 --conf apps/io_tester/conf.yaml --smp=2 --reactor-backend=linux-aio --idle-poll-time-us=0 --poll-aio=1

Prior to this change, the output is:

Starting evaluation...
0 steal: -42.983 ms awake: 56.688 cpu: 99.671
1 steal: -41.848 ms awake: 57.782 cpu: 99.630
0 steal: -42.842 ms awake: 56.878 cpu: 99.721
1 steal: -43.470 ms awake: 56.266 cpu: 99.737
1 steal: -43.937 ms awake: 55.791 cpu: 99.727
0 steal: -44.238 ms awake: 55.531 cpu: 99.769
1 steal: -42.527 ms awake: 57.136 cpu: 99.663

I.e., every 100 ms, we see ~40 ms of steal, because we think we are awake only ~60ms. Of course, we know that this workload with 0 think time is actually awake 100% of the time because as soon as any IO finishes we submit another one.

After the fix, the output is:

Starting evaluation...
0 steal:  0.445 ms awake: 100.129 cpu: 99.684
1 steal:  0.415 ms awake: 100.097 cpu: 99.682
0 steal:  0.374 ms awake: 100.135 cpu: 99.761
1 steal:  0.396 ms awake: 100.126 cpu: 99.729
0 steal:  0.419 ms awake: 100.147 cpu: 99.728
1 steal:  0.414 ms awake: 100.121 cpu: 99.707
0 steal:  0.428 ms awake: 100.151 cpu: 99.723
1 steal:  0.413 ms awake: 100.133 cpu: 99.720
0 steal:  0.442 ms awake: 100.152 cpu: 99.710
1 steal:  0.420 ms awake: 100.119 cpu: 99.699
...

I.e., we correctly identify that we are awake all the team and steal is low and positive (this is probably real steal, it's running on my noisy laptop).

When I run a stress -c 0 thread using 100% on shard 0, we get output like (after the fix):

Starting evaluation...
1 steal:  0.424 ms awake: 100.137 cpu: 99.713
0 steal: 63.933 ms awake: 120.246 cpu: 56.314
1 steal:  0.595 ms awake: 100.165 cpu: 99.570
0 steal: 48.330 ms awake: 106.006 cpu: 57.676
1 steal:  0.619 ms awake: 100.153 cpu: 99.534
0 steal: 63.599 ms awake: 118.001 cpu: 54.402
1 steal:  0.445 ms awake: 100.135 cpu: 99.690
0 steal: 48.325 ms awake: 105.988 cpu: 57.663
1 steal:  0.578 ms awake: 100.156 cpu: 99.577
0 steal: 63.855 ms awake: 117.987 cpu: 54.132
1 steal:  0.528 ms awake: 100.166 cpu: 99.638
0 steal: 48.104 ms awake: 106.009 cpu: 57.906
1 steal:  0.496 ms awake: 100.147 cpu: 99.651
0 steal: 63.986 ms awake: 117.964 cpu: 53.978

I.e., it is correctly showing 50% steal on shard 0, as we'd expect from a single competing 100% CPU thread. Pre fix, it varies from 20% to 40% steal.

travisdowns · 2024-08-20T20:14:51Z

For io_uring the benefit is much less on that specific test, probably because the overall io_uring behavior is quite different as it does not do "poll-aio=1" behavior: it does to sleep even when disk IO is outstanding, unlike aio backend by default.

Here's the same output for io_uring (with 1000ms output interval now to provide some more averaging as the numbers bounce around a lot):

pre-fix:

0 steal: -69.508 ms awake: 95.746 cpu: 165.254
1 steal: -82.224 ms awake: 117.205 cpu: 199.429
0 steal: -69.770 ms awake: 97.488 cpu: 167.258
1 steal: -82.169 ms awake: 116.266 cpu: 198.435
0 steal: -70.218 ms awake: 96.821 cpu: 167.039
1 steal: -74.251 ms awake: 103.012 cpu: 177.262
0 steal: -70.684 ms awake: 99.314 cpu: 169.998
1 steal: -66.691 ms awake: 90.682 cpu: 157.372
0 steal: -68.950 ms awake: 95.269 cpu: 164.219
1 steal: -69.599 ms awake: 94.950 cpu: 164.549
0 steal: -73.325 ms awake: 104.437 cpu: 177.763
1 steal: -69.886 ms awake: 95.770 cpu: 165.656
0 steal: -72.801 ms awake: 105.222 cpu: 178.023
1 steal: -71.757 ms awake: 101.039 cpu: 172.796
0 steal: -69.816 ms awake: 100.048 cpu: 169.864
1 steal: -73.340 ms awake: 105.095 cpu: 178.435
0 steal: -69.226 ms awake: 97.243 cpu: 166.469
1 steal: -71.744 ms awake: 98.021 cpu: 169.765

post-fix:

Starting evaluation...
0 steal: -36.019 ms awake: 136.256 cpu: 172.275
1 steal: -38.107 ms awake: 141.260 cpu: 179.367
0 steal: -34.019 ms awake: 128.660 cpu: 162.679
1 steal: -33.000 ms awake: 123.472 cpu: 156.472
1 steal: -33.379 ms awake: 124.124 cpu: 157.503
0 steal: -37.188 ms awake: 134.169 cpu: 171.357
1 steal: -40.808 ms awake: 141.996 cpu: 182.804
0 steal: -33.975 ms awake: 126.294 cpu: 160.270
1 steal: -41.464 ms awake: 143.828 cpu: 185.292
0 steal: -33.172 ms awake: 123.730 cpu: 156.902
1 steal: -34.853 ms awake: 121.107 cpu: 155.960
0 steal: -35.433 ms awake: 130.999 cpu: 166.432
1 steal: -34.941 ms awake: 128.392 cpu: 163.333
0 steal: -35.278 ms awake: 130.750 cpu: 166.028
1 steal: -33.433 ms awake: 126.931 cpu: 160.364
0 steal: -33.842 ms awake: 129.780 cpu: 163.622
1 steal: -31.123 ms awake: 119.651 cpu: 150.773
0 steal: -36.534 ms awake: 135.890 cpu: 172.424

So there is a definite reduction in false steal, from about 6-7% to about 3.5% but it's still significant for that backend.

The negative steal is about the same for io_uring even w/o the IO (or without the --idle-poll-time-us=0) arg, and the improvement in this fix is about the same.

travisdowns · 2024-08-20T21:10:11Z

I wanted to add that for all backends we will in some cases (the case where we get all the way to the underlying backend "wait for events" function) make 2 timestamp calls instead of 1, since before the "after" timestamp was shared with the idle_end logic, but now it is not.

However, in other cases we will take 0 timestamps instead of 1, because in try_sleep we abandon trying to sleep in many cases and don't ever get to the sleep location. For aio I expect this latter case to be very common since we don't go to sleep if there is any outstanding IO in the default --poll-aio=1 mode. For other backends it less clear.

Both of these cases are on the "idle" or "about to sleep" paths though so probably less performance sensitive.

travisdowns · 2024-09-11T19:53:41Z

@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc.

avikivity · 2024-09-11T20:00:28Z

@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc.

@xemul can you handle it?

travisdowns · 2024-10-11T15:17:51Z

@xemul any hope that you can take a peek at this one? In our experience this is a big improvement to the accuracy and hence usefulness of the steal time metric.

xemul · 2024-10-14T12:06:50Z

Sorry for losing track of this one :( I'll review it shortly

xemul · 2024-10-14T12:11:00Z

src/core/reactor.cc

+            sm::make_counter("awake_time_ms_total", [this] () -> int64_t { return total_awake_time() / 1ms; },
+                    sm::description("Total reactor awake time (wall_clock)")),
+            sm::make_counter("cpu_used_time_ms", [this] () -> int64_t { return total_cpu_time() / 1ms; },
+                    sm::description("Total reactor thread CPU time (from CLOCK_THREAD_CPUTIME)")),


OK, so in these terms:

awake-busy
awake-idle
sleeping

The (existing) cpu_busy_time is awake_busy, the (new) awake_time_ms_total is awake_body + awake_idle. Is that correct?

Yes, exactly.

xemul · 2024-10-14T12:16:29Z

src/core/reactor.cc

    // But what we have here should be good enough and at least has a well defined meaning.
-    return std::chrono::duration_cast<std::chrono::nanoseconds>(now() - _start_time - _total_sleep) -
-           std::chrono::duration_cast<std::chrono::nanoseconds>(thread_cputime_clock::now().time_since_epoch());
+    return total_awake_time() - total_cpu_time();


I'd really appreciate if this change went to previous (or separate, whichever is simpler for you) patch. It has nothing to do with updating the way _total_sleep is calculated, but is just re-using the helper methods introduced by previous change

I have moved this to the prior commit in the most recent push.

xemul · 2024-10-14T12:55:16Z

src/core/reactor.cc

+    auto mono_steal = _last_mono_steal + std::max(true_steal - _last_true_steal, 0ns);
+
+    _last_true_steal = true_steal;
+    _last_mono_steal = mono_steal;


In other words, the mono_steal is the sum of all positive changes in true_steal, and the "change" is calculated only at times when this function is called, right?

Yes, exactly. It does introduce an "observation effect" in some sense: the results may be different depending on how often you observe the metric, but it's the best I could come up with and seems like it works well in practice since at the observation points are where you care about this.

That's unfortunate :( Let's try to keep this "update rate" under some control. I can imagine two ways, none is perfect either.

First -- together with this place, update steal time (or maybe mono-steal part) when reactor stops sleeping.
Second -- arm a (lowres?) timer that will update the steal time, this getter only reports what had been accumulated so far.

The main problem I see is the steal time calculation is quite a bit more expensive than the other time related calculations, since it involves thread time, so a system call is required.

I do agree that both of these suggestions are more accurate in some sense (basically, the faster the update rate, the more accurate), but the only/primary purpose of steal time is to be reported as a metric, so tying it to the metrics pull rate seems like a reasonable match. In practice the amount of error is very small after this change, at least for the aio backend, so we can be a bit looser with the update strategy. The typical case that's interesting is that you have some sort of sustained non-zero steal and this will work fine there.

reactor::_total_sleep, a std::chrono::duration, was left uninitialized in the constructor, potentially leading to wild values.

travisdowns · 2024-10-14T14:26:57Z

b6fac76 is purely a rebase to the tip of master.

Conceptually at all wall-clock moments the reactor is in one of three states: awake-busy awake-idle sleeping Roughly speaking awake-busy is when the reactor has tasks or run or IO to submit or reap, or there are waiting messages for this shard on the cross-shard queues. This is the time counted as utilized in "reactor utilization". awake-idle means that the reactor is still running (or wanting to run on the CPU), but does not have any work to do _now_, rather it spends its time polling the cross-shard queues for new work, seeing if IO has arrived to reap etc. From the OS point of view it is still demanding CPU time just like the awake-busy state, but in the internal reactor accounting we know it is not making forward progress on work, rather just waiting for work to arrive. sleeping - after some time in the awake-idle state and if certain other conditions are met, the reactor may actually go to sleep. In this state is uses an OS blocking call like io_getevents with a timeout to deschedule itself from the CPU and be woken only when additional work arrives (which will be detected somehow by the OS blocking mechanism). In this state the reactor is not using additional CPU. Having a clear view of these states in the reactor metrics is useful and this change adds all three metrics with names similar to the above.

We track sleep time by taking timestamps before and after we start and stop "sleep". This accounting is key for things like steal time, which uses sleep time to subtract from real time to determine the time the reactor wanted to run. Currently we take the start timestamp far too early, and also in many cases where we do not actually sleep: we take it before we "reap events + maybe sleep" but in many cases we do not sleep, yet count it as sleep time. So we can do significant work in time which is being recorded as "sleep". On specific effect of this is that steal time can be negative, because of the way steal time is calculated: inflated sleep time leads to CPU time used (rusage) to be greater than reactor awake time (wall clock based). After this change, steal time can still be slightly negative but the effect is reduced by 1 or 2 orders of magnitude.

Six years ago this function was added/emptied but it hasn't had anything added to give it a purpose in life so it seems like a good time to remove it.

total_steal_time() can decrease in value from call to call, a violation of the rule that a counter metric must never decrease. This happens because steal time is "awake time (wall clock)" minus "cpu thread time (get_rusage style CPU time)". The awake time is itself composed of "time since reactor start" minus "accumulated sleep time", but it can be under-counted because sleep time is over-counted: as it's not possible to determine exactly the true sleep time, only get timestamps before and after a period you think might involve a sleep. Currently, sleep is even more significantly over-counted than the error described above as it is measured at a point which includes significant non-sleep work. The result is that when there is little to no true steal, CPU time will exceed measured awake wall clock time, resulting in negative steal. This change "fixes" this by enforcing that steal time is monotonically increasing. This occurs at measurement time by checking if "true steal" (i.e., the old definition of steal) has increased since the last measurement and adding that delta to our monotonic steal counter if so. Otherwise the delta is dropped. While not totally ideal this leads to a useful metric which mostly clamps away the error related to negative steal times, and more importantly avoids the catastrophic failure of PromQL functions when used on non-monotonic functions. Fixes scylladb#1521. Fixes scylladb#2374.

travisdowns · 2024-10-19T01:21:41Z

@xemul all feedback addressed or at least responded to.

travisdowns mentioned this pull request Aug 15, 2024

Steal time constantly goes down #2374

Open

mykaul requested review from avikivity and xemul August 18, 2024 14:40

avikivity reviewed Aug 18, 2024

View reviewed changes

travisdowns force-pushed the td-scylla-steal-time branch from 1afde33 to 32a9753 Compare August 20, 2024 21:00

xemul reviewed Oct 14, 2024

View reviewed changes

Zero-init total sleep time

1c930c4

reactor::_total_sleep, a std::chrono::duration, was left uninitialized in the constructor, potentially leading to wild values.

travisdowns force-pushed the td-scylla-steal-time branch from 32a9753 to b6fac76 Compare October 14, 2024 14:26

travisdowns added 4 commits October 14, 2024 15:47

Remove account_idle

776fa65

Six years ago this function was added/emptied but it hasn't had anything added to give it a purpose in life so it seems like a good time to remove it.

travisdowns force-pushed the td-scylla-steal-time branch 3 times, most recently from 8208958 to 8efeeb0 Compare October 14, 2024 19:11

travisdowns requested a review from xemul October 14, 2024 19:13

daisukebe mentioned this pull request Oct 15, 2024

application: clamp CPU time to ensure monotonicity redpanda-data/redpanda#23771

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address non-monotonicity of steal time and other issues #2390

Address non-monotonicity of steal time and other issues #2390

travisdowns commented Aug 15, 2024 •

edited

Loading

avikivity Aug 18, 2024

avikivity Aug 18, 2024

travisdowns Aug 19, 2024

travisdowns Aug 19, 2024

travisdowns Aug 20, 2024

avikivity commented Aug 18, 2024

travisdowns commented Aug 19, 2024 •

edited

Loading

travisdowns commented Aug 20, 2024

travisdowns commented Aug 20, 2024 •

edited

Loading

travisdowns commented Aug 20, 2024 •

edited

Loading

travisdowns commented Sep 11, 2024

avikivity commented Sep 11, 2024

travisdowns commented Oct 11, 2024

xemul commented Oct 14, 2024

xemul Oct 14, 2024

travisdowns Oct 14, 2024

xemul Oct 14, 2024

travisdowns Oct 14, 2024

travisdowns Oct 14, 2024

xemul Oct 14, 2024

travisdowns Oct 14, 2024

xemul Oct 15, 2024

travisdowns Oct 19, 2024

travisdowns commented Oct 14, 2024

travisdowns commented Oct 19, 2024

		@@ -578,6 +578,8 @@ public:

		steady_clock_type::duration total_idle_time();

Address non-monotonicity of steal time and other issues #2390

Are you sure you want to change the base?

Address non-monotonicity of steal time and other issues #2390

Conversation

travisdowns commented Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avikivity commented Aug 18, 2024

travisdowns commented Aug 19, 2024 • edited Loading

travisdowns commented Aug 20, 2024

travisdowns commented Aug 20, 2024 • edited Loading

travisdowns commented Aug 20, 2024 • edited Loading

travisdowns commented Sep 11, 2024

avikivity commented Sep 11, 2024

travisdowns commented Oct 11, 2024

xemul commented Oct 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns commented Oct 14, 2024

travisdowns commented Oct 19, 2024

travisdowns commented Aug 15, 2024 •

edited

Loading

travisdowns commented Aug 19, 2024 •

edited

Loading

travisdowns commented Aug 20, 2024 •

edited

Loading

travisdowns commented Aug 20, 2024 •

edited

Loading