-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address non-monotonicity of steal time and other issues #2390
base: master
Are you sure you want to change the base?
Conversation
@@ -578,6 +578,8 @@ public: | |||
|
|||
steady_clock_type::duration total_idle_time(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awake-busy
awake-idle
sleeping
Most propitious, we were just discussing this on our own customer issue.
r = io_pgetevents(_polling_io.io_context, 1, batch_size, batch, tsp, active_sigmask); | ||
if (may_sleep) { | ||
_r._total_sleep += sched_clock::now() - before_getevents; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come only one backend gets this adjustment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because it's the one we use currently. This change is from a year or more ago and was dusted off. Let me check the other backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's slowly coming back: updating the other reactor backends was part of why I never upstreamed this in the first place. Doing that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the last push this is fixed for all 3 reactor backends.
The changes look reasonable, but isn't steal time terminally broken? thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny, but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push. |
Yes exactly, that is the underlying flaw here which is only improved quantitatively here but not eliminated. Another "bug" is that we can't actually even tell if we are going to sleep in the system call: However, for aio specifically this adjustment of the timing points provided good results in practice. I don't have good numbers here since this is an old change but based on my long-ago testing pushing down the sleep calculation reduced negative steal to a very small amount, often rounding to 0 (in ms) over the sample period. If there was almost any "true" steal it is now almost always enough to overwhelm the negative steal part. In practice under load we almost always see significant "true" steal even if nothing else is running due to the syscall thread which is constantly working to handle EAGAIN IOs :(. So this is why we need the second part of this change which simply discards negative steal periods (note that is is different from simply preventing the steal counter from going down).
Is that true? Currently it looks like we use two calls still in the uring backend: |
Here are some results for a really bad case for the existing steal time calculation. This is running
and running it like:
Prior to this change, the output is:
I.e., every 100 ms, we see ~40 ms of steal, because we think we are awake only ~60ms. Of course, we know that this workload with 0 think time is actually awake 100% of the time because as soon as any IO finishes we submit another one. After the fix, the output is:
I.e., we correctly identify that we are awake all the team and steal is low and positive (this is probably real steal, it's running on my noisy laptop). When I run a
I.e., it is correctly showing 50% steal on shard 0, as we'd expect from a single competing 100% CPU thread. Pre fix, it varies from 20% to 40% steal. |
For Here's the same output for pre-fix:
post-fix:
So there is a definite reduction in false steal, from about 6-7% to about 3.5% but it's still significant for that backend. The negative steal is about the same for |
1afde33
to
32a9753
Compare
I wanted to add that for all backends we will in some cases (the case where we get all the way to the underlying backend "wait for events" function) make 2 timestamp calls instead of 1, since before the "after" timestamp was shared with the idle_end logic, but now it is not. However, in other cases we will take 0 timestamps instead of 1, because in Both of these cases are on the "idle" or "about to sleep" paths though so probably less performance sensitive. |
@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc. |
@xemul can you handle it? |
@xemul any hope that you can take a peek at this one? In our experience this is a big improvement to the accuracy and hence usefulness of the steal time metric. |
Sorry for losing track of this one :( I'll review it shortly |
sm::make_counter("awake_time_ms_total", [this] () -> int64_t { return total_awake_time() / 1ms; }, | ||
sm::description("Total reactor awake time (wall_clock)")), | ||
sm::make_counter("cpu_used_time_ms", [this] () -> int64_t { return total_cpu_time() / 1ms; }, | ||
sm::description("Total reactor thread CPU time (from CLOCK_THREAD_CPUTIME)")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so in these terms:
awake-busy
awake-idle
sleeping
The (existing) cpu_busy_time is awake_busy, the (new) awake_time_ms_total is awake_body + awake_idle. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
src/core/reactor.cc
Outdated
// But what we have here should be good enough and at least has a well defined meaning. | ||
return std::chrono::duration_cast<std::chrono::nanoseconds>(now() - _start_time - _total_sleep) - | ||
std::chrono::duration_cast<std::chrono::nanoseconds>(thread_cputime_clock::now().time_since_epoch()); | ||
return total_awake_time() - total_cpu_time(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really appreciate if this change went to previous (or separate, whichever is simpler for you) patch. It has nothing to do with updating the way _total_sleep is calculated, but is just re-using the helper methods introduced by previous change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved this to the prior commit in the most recent push.
auto mono_steal = _last_mono_steal + std::max(true_steal - _last_true_steal, 0ns); | ||
|
||
_last_true_steal = true_steal; | ||
_last_mono_steal = mono_steal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, the mono_steal is the sum of all positive changes in true_steal, and the "change" is calculated only at times when this function is called, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. It does introduce an "observation effect" in some sense: the results may be different depending on how often you observe the metric, but it's the best I could come up with and seems like it works well in practice since at the observation points are where you care about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate :( Let's try to keep this "update rate" under some control. I can imagine two ways, none is perfect either.
First -- together with this place, update steal time (or maybe mono-steal part) when reactor stops sleeping.
Second -- arm a (lowres?) timer that will update the steal time, this getter only reports what had been accumulated so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem I see is the steal time calculation is quite a bit more expensive than the other time related calculations, since it involves thread time, so a system call is required.
I do agree that both of these suggestions are more accurate in some sense (basically, the faster the update rate, the more accurate), but the only/primary purpose of steal time is to be reported as a metric, so tying it to the metrics pull rate seems like a reasonable match. In practice the amount of error is very small after this change, at least for the aio backend, so we can be a bit looser with the update strategy. The typical case that's interesting is that you have some sort of sustained non-zero steal and this will work fine there.
reactor::_total_sleep, a std::chrono::duration, was left uninitialized in the constructor, potentially leading to wild values.
32a9753
to
b6fac76
Compare
b6fac76 is purely a rebase to the tip of master. |
Conceptually at all wall-clock moments the reactor is in one of three states: awake-busy awake-idle sleeping Roughly speaking awake-busy is when the reactor has tasks or run or IO to submit or reap, or there are waiting messages for this shard on the cross-shard queues. This is the time counted as utilized in "reactor utilization". awake-idle means that the reactor is still running (or wanting to run on the CPU), but does not have any work to do _now_, rather it spends its time polling the cross-shard queues for new work, seeing if IO has arrived to reap etc. From the OS point of view it is still demanding CPU time just like the awake-busy state, but in the internal reactor accounting we know it is not making forward progress on work, rather just waiting for work to arrive. sleeping - after some time in the awake-idle state and if certain other conditions are met, the reactor may actually go to sleep. In this state is uses an OS blocking call like io_getevents with a timeout to deschedule itself from the CPU and be woken only when additional work arrives (which will be detected somehow by the OS blocking mechanism). In this state the reactor is not using additional CPU. Having a clear view of these states in the reactor metrics is useful and this change adds all three metrics with names similar to the above.
We track sleep time by taking timestamps before and after we start and stop "sleep". This accounting is key for things like steal time, which uses sleep time to subtract from real time to determine the time the reactor wanted to run. Currently we take the start timestamp far too early, and also in many cases where we do not actually sleep: we take it before we "reap events + maybe sleep" but in many cases we do not sleep, yet count it as sleep time. So we can do significant work in time which is being recorded as "sleep". On specific effect of this is that steal time can be negative, because of the way steal time is calculated: inflated sleep time leads to CPU time used (rusage) to be greater than reactor awake time (wall clock based). After this change, steal time can still be slightly negative but the effect is reduced by 1 or 2 orders of magnitude.
Six years ago this function was added/emptied but it hasn't had anything added to give it a purpose in life so it seems like a good time to remove it.
total_steal_time() can decrease in value from call to call, a violation of the rule that a counter metric must never decrease. This happens because steal time is "awake time (wall clock)" minus "cpu thread time (get_rusage style CPU time)". The awake time is itself composed of "time since reactor start" minus "accumulated sleep time", but it can be under-counted because sleep time is over-counted: as it's not possible to determine exactly the true sleep time, only get timestamps before and after a period you think might involve a sleep. Currently, sleep is even more significantly over-counted than the error described above as it is measured at a point which includes significant non-sleep work. The result is that when there is little to no true steal, CPU time will exceed measured awake wall clock time, resulting in negative steal. This change "fixes" this by enforcing that steal time is monotonically increasing. This occurs at measurement time by checking if "true steal" (i.e., the old definition of steal) has increased since the last measurement and adding that delta to our monotonic steal counter if so. Otherwise the delta is dropped. While not totally ideal this leads to a useful metric which mostly clamps away the error related to negative steal times, and more importantly avoids the catastrophic failure of PromQL functions when used on non-monotonic functions. Fixes scylladb#1521. Fixes scylladb#2374.
8208958
to
8efeeb0
Compare
@xemul all feedback addressed or at least responded to. |
This series primarily addresses the problem that on systems with low amounts of steal, steal time appears negative (i.e., the cumulative steal time counter goes down from sample to sample).
This wrong on the face of it and also causes serious problems as a metric in prometheus since the counter contract (monotonic increase) is violated. This causes spurious "counter reset" detection in prometheus and hence bogus very large or very small steal time results in
rate
(or similar) queries.This is addressed in two ways:
The individual changes have further details. I am open to splitting commits that may be less popular or require more discussion into a different PR if it makes sense.