Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address non-monotonicity of steal time and other issues #2390

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

travisdowns
Copy link
Contributor

@travisdowns travisdowns commented Aug 15, 2024

This series primarily addresses the problem that on systems with low amounts of steal, steal time appears negative (i.e., the cumulative steal time counter goes down from sample to sample).

This wrong on the face of it and also causes serious problems as a metric in prometheus since the counter contract (monotonic increase) is violated. This causes spurious "counter reset" detection in prometheus and hence bogus very large or very small steal time results in rate (or similar) queries.

This is addressed in two ways:

  • We make the sleep time calculation more accurate, which is the underlying reason for negative steal which reduces the error (and so "negativeness") of steal by a couple orders of magnitude. After this change steal time is often 0 when rounded to the nearest ms where it wasn't before.
  • Because the reduction above still does not prevent small negative steal completely, we change the implementation of the metric to essentially cap steal from below 0 in periods where steal was negative.

The individual changes have further details. I am open to splitting commits that may be less popular or require more discussion into a different PR if it makes sense.

@@ -578,6 +578,8 @@ public:

steady_clock_type::duration total_idle_time();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awake-busy
awake-idle
sleeping

Most propitious, we were just discussing this on our own customer issue.

r = io_pgetevents(_polling_io.io_context, 1, batch_size, batch, tsp, active_sigmask);
if (may_sleep) {
_r._total_sleep += sched_clock::now() - before_getevents;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come only one backend gets this adjustment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because it's the one we use currently. This change is from a year or more ago and was dusted off. Let me check the other backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's slowly coming back: updating the other reactor backends was part of why I never upstreamed this in the first place. Doing that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the last push this is fixed for all 3 reactor backends.

@avikivity
Copy link
Member

The changes look reasonable, but isn't steal time terminally broken?

thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny, but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push.

@travisdowns
Copy link
Contributor Author

travisdowns commented Aug 19, 2024

thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny

Yes exactly, that is the underlying flaw here which is only improved quantitatively here but not eliminated. Another "bug" is that we can't actually even tell if we are going to sleep in the system call: may_sleep can be true, but we can find IO right away in which case the getevents call will return and we did not sleep at all but all this work got counted as sleep and this problem can repeat at high frequency for some IO patterns. However, in practice steal has been quite useful for us in a variety of incidents, even in its flawed state.

However, for aio specifically this adjustment of the timing points provided good results in practice. I don't have good numbers here since this is an old change but based on my long-ago testing pushing down the sleep calculation reduced negative steal to a very small amount, often rounding to 0 (in ms) over the sample period. If there was almost any "true" steal it is now almost always enough to overwhelm the negative steal part. In practice under load we almost always see significant "true" steal even if nothing else is running due to the syscall thread which is constantly working to handle EAGAIN IOs :(.

So this is why we need the second part of this change which simply discards negative steal periods (note that is is different from simply preventing the steal counter from going down).

but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push.

Is that true? Currently it looks like we use two calls still in the uring backend: io_uring_submit to push and io_uring_wait_cqes to wait. The were some changes in this area I think order to allow a single syscall but they were reverted though even with that change this method still had the separate submit + wait calls.

@travisdowns
Copy link
Contributor Author

Here are some results for a really bad case for the existing steal time calculation. This is running io_tester with this patch which prints out steal time metrics periodically and this config:

- name: steal_printer
  shards: all
  type: steal
  shard_info:
    parallelism: 1
    think_time: 10ms

- name: big_writes
  shards: all
  type: seqwrite
  shard_info:
    parallelism: 1
    reqsize: 1kB
    # shares: 10
    think_time: 0us

and running it like:

ninja io_tester && apps/io_tester/io_tester --duration=10 --conf apps/io_tester/conf.yaml --smp=2 --reactor-backend=linux-aio --idle-poll-time-us=0 --poll-aio=1

Prior to this change, the output is:

Starting evaluation...
0 steal: -42.983 ms awake: 56.688 cpu: 99.671
1 steal: -41.848 ms awake: 57.782 cpu: 99.630
0 steal: -42.842 ms awake: 56.878 cpu: 99.721
1 steal: -43.470 ms awake: 56.266 cpu: 99.737
1 steal: -43.937 ms awake: 55.791 cpu: 99.727
0 steal: -44.238 ms awake: 55.531 cpu: 99.769
1 steal: -42.527 ms awake: 57.136 cpu: 99.663

I.e., every 100 ms, we see ~40 ms of steal, because we think we are awake only ~60ms. Of course, we know that this workload with 0 think time is actually awake 100% of the time because as soon as any IO finishes we submit another one.

After the fix, the output is:

Starting evaluation...
0 steal:  0.445 ms awake: 100.129 cpu: 99.684
1 steal:  0.415 ms awake: 100.097 cpu: 99.682
0 steal:  0.374 ms awake: 100.135 cpu: 99.761
1 steal:  0.396 ms awake: 100.126 cpu: 99.729
0 steal:  0.419 ms awake: 100.147 cpu: 99.728
1 steal:  0.414 ms awake: 100.121 cpu: 99.707
0 steal:  0.428 ms awake: 100.151 cpu: 99.723
1 steal:  0.413 ms awake: 100.133 cpu: 99.720
0 steal:  0.442 ms awake: 100.152 cpu: 99.710
1 steal:  0.420 ms awake: 100.119 cpu: 99.699
...

I.e., we correctly identify that we are awake all the team and steal is low and positive (this is probably real steal, it's running on my noisy laptop).

When I run a stress -c 0 thread using 100% on shard 0, we get output like (after the fix):

Starting evaluation...
1 steal:  0.424 ms awake: 100.137 cpu: 99.713
0 steal: 63.933 ms awake: 120.246 cpu: 56.314
1 steal:  0.595 ms awake: 100.165 cpu: 99.570
0 steal: 48.330 ms awake: 106.006 cpu: 57.676
1 steal:  0.619 ms awake: 100.153 cpu: 99.534
0 steal: 63.599 ms awake: 118.001 cpu: 54.402
1 steal:  0.445 ms awake: 100.135 cpu: 99.690
0 steal: 48.325 ms awake: 105.988 cpu: 57.663
1 steal:  0.578 ms awake: 100.156 cpu: 99.577
0 steal: 63.855 ms awake: 117.987 cpu: 54.132
1 steal:  0.528 ms awake: 100.166 cpu: 99.638
0 steal: 48.104 ms awake: 106.009 cpu: 57.906
1 steal:  0.496 ms awake: 100.147 cpu: 99.651
0 steal: 63.986 ms awake: 117.964 cpu: 53.978

I.e., it is correctly showing 50% steal on shard 0, as we'd expect from a single competing 100% CPU thread. Pre fix, it varies from 20% to 40% steal.

@travisdowns
Copy link
Contributor Author

travisdowns commented Aug 20, 2024

For io_uring the benefit is much less on that specific test, probably because the overall io_uring behavior is quite different as it does not do "poll-aio=1" behavior: it does to sleep even when disk IO is outstanding, unlike aio backend by default.

Here's the same output for io_uring (with 1000ms output interval now to provide some more averaging as the numbers bounce around a lot):

pre-fix:

0 steal: -69.508 ms awake: 95.746 cpu: 165.254
1 steal: -82.224 ms awake: 117.205 cpu: 199.429
0 steal: -69.770 ms awake: 97.488 cpu: 167.258
1 steal: -82.169 ms awake: 116.266 cpu: 198.435
0 steal: -70.218 ms awake: 96.821 cpu: 167.039
1 steal: -74.251 ms awake: 103.012 cpu: 177.262
0 steal: -70.684 ms awake: 99.314 cpu: 169.998
1 steal: -66.691 ms awake: 90.682 cpu: 157.372
0 steal: -68.950 ms awake: 95.269 cpu: 164.219
1 steal: -69.599 ms awake: 94.950 cpu: 164.549
0 steal: -73.325 ms awake: 104.437 cpu: 177.763
1 steal: -69.886 ms awake: 95.770 cpu: 165.656
0 steal: -72.801 ms awake: 105.222 cpu: 178.023
1 steal: -71.757 ms awake: 101.039 cpu: 172.796
0 steal: -69.816 ms awake: 100.048 cpu: 169.864
1 steal: -73.340 ms awake: 105.095 cpu: 178.435
0 steal: -69.226 ms awake: 97.243 cpu: 166.469
1 steal: -71.744 ms awake: 98.021 cpu: 169.765

post-fix:

Starting evaluation...
0 steal: -36.019 ms awake: 136.256 cpu: 172.275
1 steal: -38.107 ms awake: 141.260 cpu: 179.367
0 steal: -34.019 ms awake: 128.660 cpu: 162.679
1 steal: -33.000 ms awake: 123.472 cpu: 156.472
1 steal: -33.379 ms awake: 124.124 cpu: 157.503
0 steal: -37.188 ms awake: 134.169 cpu: 171.357
1 steal: -40.808 ms awake: 141.996 cpu: 182.804
0 steal: -33.975 ms awake: 126.294 cpu: 160.270
1 steal: -41.464 ms awake: 143.828 cpu: 185.292
0 steal: -33.172 ms awake: 123.730 cpu: 156.902
1 steal: -34.853 ms awake: 121.107 cpu: 155.960
0 steal: -35.433 ms awake: 130.999 cpu: 166.432
1 steal: -34.941 ms awake: 128.392 cpu: 163.333
0 steal: -35.278 ms awake: 130.750 cpu: 166.028
1 steal: -33.433 ms awake: 126.931 cpu: 160.364
0 steal: -33.842 ms awake: 129.780 cpu: 163.622
1 steal: -31.123 ms awake: 119.651 cpu: 150.773
0 steal: -36.534 ms awake: 135.890 cpu: 172.424

So there is a definite reduction in false steal, from about 6-7% to about 3.5% but it's still significant for that backend.

The negative steal is about the same for io_uring even w/o the IO (or without the --idle-poll-time-us=0) arg, and the improvement in this fix is about the same.

@travisdowns
Copy link
Contributor Author

travisdowns commented Aug 20, 2024

I wanted to add that for all backends we will in some cases (the case where we get all the way to the underlying backend "wait for events" function) make 2 timestamp calls instead of 1, since before the "after" timestamp was shared with the idle_end logic, but now it is not.

However, in other cases we will take 0 timestamps instead of 1, because in try_sleep we abandon trying to sleep in many cases and don't ever get to the sleep location. For aio I expect this latter case to be very common since we don't go to sleep if there is any outstanding IO in the default --poll-aio=1 mode. For other backends it less clear.

Both of these cases are on the "idle" or "about to sleep" paths though so probably less performance sensitive.

@travisdowns
Copy link
Contributor Author

@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc.

@avikivity
Copy link
Member

@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc.

@xemul can you handle it?

@travisdowns
Copy link
Contributor Author

@xemul any hope that you can take a peek at this one? In our experience this is a big improvement to the accuracy and hence usefulness of the steal time metric.

@xemul
Copy link
Contributor

xemul commented Oct 14, 2024

Sorry for losing track of this one :( I'll review it shortly

sm::make_counter("awake_time_ms_total", [this] () -> int64_t { return total_awake_time() / 1ms; },
sm::description("Total reactor awake time (wall_clock)")),
sm::make_counter("cpu_used_time_ms", [this] () -> int64_t { return total_cpu_time() / 1ms; },
sm::description("Total reactor thread CPU time (from CLOCK_THREAD_CPUTIME)")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so in these terms:

awake-busy
awake-idle
sleeping

The (existing) cpu_busy_time is awake_busy, the (new) awake_time_ms_total is awake_body + awake_idle. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

// But what we have here should be good enough and at least has a well defined meaning.
return std::chrono::duration_cast<std::chrono::nanoseconds>(now() - _start_time - _total_sleep) -
std::chrono::duration_cast<std::chrono::nanoseconds>(thread_cputime_clock::now().time_since_epoch());
return total_awake_time() - total_cpu_time();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really appreciate if this change went to previous (or separate, whichever is simpler for you) patch. It has nothing to do with updating the way _total_sleep is calculated, but is just re-using the helper methods introduced by previous change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this to the prior commit in the most recent push.

auto mono_steal = _last_mono_steal + std::max(true_steal - _last_true_steal, 0ns);

_last_true_steal = true_steal;
_last_mono_steal = mono_steal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, the mono_steal is the sum of all positive changes in true_steal, and the "change" is calculated only at times when this function is called, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. It does introduce an "observation effect" in some sense: the results may be different depending on how often you observe the metric, but it's the best I could come up with and seems like it works well in practice since at the observation points are where you care about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate :( Let's try to keep this "update rate" under some control. I can imagine two ways, none is perfect either.

First -- together with this place, update steal time (or maybe mono-steal part) when reactor stops sleeping.
Second -- arm a (lowres?) timer that will update the steal time, this getter only reports what had been accumulated so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem I see is the steal time calculation is quite a bit more expensive than the other time related calculations, since it involves thread time, so a system call is required.

I do agree that both of these suggestions are more accurate in some sense (basically, the faster the update rate, the more accurate), but the only/primary purpose of steal time is to be reported as a metric, so tying it to the metrics pull rate seems like a reasonable match. In practice the amount of error is very small after this change, at least for the aio backend, so we can be a bit looser with the update strategy. The typical case that's interesting is that you have some sort of sustained non-zero steal and this will work fine there.

reactor::_total_sleep, a std::chrono::duration, was left uninitialized
in the constructor, potentially leading to wild values.
@travisdowns
Copy link
Contributor Author

b6fac76 is purely a rebase to the tip of master.

Conceptually at all wall-clock moments the reactor is in one of three
states:

awake-busy
awake-idle
sleeping

Roughly speaking awake-busy is when the reactor has tasks or run or IO
to submit or reap, or there are waiting messages for this shard on the
cross-shard queues. This is the time counted as utilized in
"reactor utilization".

awake-idle means that the reactor is still running (or wanting to run
on the CPU), but does not have any work to do _now_, rather it spends
its time polling the cross-shard queues for new work, seeing if IO has
arrived to reap etc. From the OS point of view it is still demanding
CPU time just like the awake-busy state, but in the internal reactor
accounting we know it is not making forward progress on work, rather
just waiting for work to arrive.

sleeping - after some time in the awake-idle state and if certain
other conditions are met, the reactor may actually go to sleep. In this
state is uses an OS blocking call like io_getevents with a timeout to
deschedule itself from the CPU and be woken only when additional work
arrives (which will be detected somehow by the OS blocking mechanism).
In this state the reactor is not using additional CPU.

Having a clear view of these states in the reactor metrics is useful
and this change adds all three metrics with names similar to the above.
We track sleep time by taking timestamps before and after we start and
stop "sleep". This accounting is key for things like steal time, which
uses sleep time to subtract from real time to determine the time the
reactor wanted to run.

Currently we take the start timestamp far too early, and also in many
cases where we do not actually sleep: we take it before we "reap events
+ maybe sleep" but in many cases we do not sleep, yet count it as
sleep time. So we can do significant work in time which is being
recorded as "sleep".

On specific effect of this is that steal time can be negative, because
of the way steal time is calculated: inflated sleep time leads to
CPU time used (rusage) to be greater than reactor awake time (wall
clock based).

After this change, steal time can still be slightly negative but the
effect is reduced by 1 or 2 orders of magnitude.
Six years ago this function was added/emptied but it hasn't had anything
added to give it a purpose in life so it seems like a good time to
remove it.
total_steal_time() can decrease in value from call to call, a violation
of the rule that a counter metric must never decrease.

This happens because steal time is "awake time (wall clock)" minus
"cpu thread time (get_rusage style CPU time)". The awake time is itself
composed of "time since reactor start" minus "accumulated sleep time",
but it can be under-counted because sleep time is over-counted: as it's
not possible to determine exactly the true sleep time, only get
timestamps before and after a period you think might involve a sleep.

Currently, sleep is even more significantly over-counted than the error
described above as it is measured at a point which includes significant
non-sleep work.

The result is that when there is little to no true steal, CPU time will
exceed measured awake wall clock time, resulting in negative steal.

This change "fixes" this by enforcing that steal time is monotonically
increasing. This occurs at measurement time by checking if "true steal"
(i.e., the old definition of steal) has increased since the last
measurement and adding that delta to our monotonic steal counter if so.
Otherwise the delta is dropped.

While not totally ideal this leads to a useful metric which mostly
clamps away the error related to negative steal times, and more
importantly avoids the catastrophic failure of PromQL functions when
used on non-monotonic functions.

Fixes scylladb#1521.
Fixes scylladb#2374.
@travisdowns
Copy link
Contributor Author

@xemul all feedback addressed or at least responded to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants