I love a good bug, especially ones that are initially hard to explain but then turn into forehead slapping moments - of course!
There's a bug over on Github called Hysteresis effect on threadpool hill-climbing that is a super interesting read.
Hill climbing is an algorithmic technique when you have a problem (a hill) and then you improve and improve (climb) until you have reached some maximum acceptable solution (reach the top).
Sebastian, the OP of the bug, says there's a "Hysteresis effect" on the threadpool. "Hysteresis is the dependence of the state of a system on its history." Something weird is happening now because something happened before...but what was it?
Sawtooth up and down graphs aren't THAT interesting...but look at the x-axis. This isn't showing minute by minute or even millisecond by millisecond ups and downs like you may have seen before. This x-axis uses months as its unit of measure. Read that again and drink it in.
Things are cool in February until they are bad for weeks and then cool in March and it moves along a cycle. Not strictly the cycle of the moon but close.
He's seeing the number of cores being used changing from month to month when using PortableThreadPool
We have noticed a periodic pattern on the threadpool hill-climbing logic, which uses either
n-cores
orn-cores + 20
with an hysteresis effect that switches every 3-4 weeks.
Did you know (I know because I'm old) that Windows 95, for a time, was unable to run longer than 49.7 days of runtime? If you ran it that long it would eventually crash! That's because there's 86M ms in a day, i.e. 1000 * 60 * 60 * 24 = 86,400,000 and 32 bits is 4,294,967,296 so 4,294,967,296 / 86,400,000 = 49.7102696 days!
Kevin in the Github issues notes this as well:
The whole period of square wave sounds an awful lot like it is around 49.7 days. That is how long it takes
GetTickCount()
to wrap around. On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.
This 49.7 days number is well known as it's how long it take before GetTickCount() overflows/wraps. Kevin goes on to actually give the changeover dates which correspond to the graph!
- Thu Jan 14 2021
- Sun Feb 07 2021
- Thu Mar 04 2021
- Mon Mar 29 2021
- Fri Apr 23 2021
He then finds the code in PortableThreadPool.cs that explains the issue:
private bool ShouldAdjustMaxWorkersActive(int currentTimeMs)
{
// We need to subtract by prior time because Environment.TickCount can wrap around, making a comparison of absolute times unreliable.
int priorTime = Volatile.Read(ref _separated.priorCompletedWorkRequestsTime);
int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;
if (elapsedInterval >= requiredInterval)
{
...
He says, and this is all Kevin:
currentTimeMs is Environment.TickCount
, which in this case happens to be negative.
The if clause controls if the hill climbing is even run.
_separated.priorCompletedWorkRequestsTime
and _separated.nextCompletedWorkRequestsTime
start out as zero on process start, and only get updated if the hill climbing code is run.
Therefore, requiredInterval = 0 - 0
and elapsedInterval = negativeNumber - 0
. This causes the if statement to become
if (negativeNumber - 0 >= 0 - 0)
which returns false, so the hill climbing code is not run, and therefore the variables never get updated, and remain zero. The native version of the thread pool code does all math with unsigned numbers which would avoid such a bug, and it's equivalent part is not even quite the same math in the first place.
The easy fix here is probably to use unsigned arithmetic, but alternatively having the two fields get initialized to Environment.TickCount probably also work
Back to me. Fabulous. The fix is to case the results to unsigned ints via (uint).
Before:
int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;
After:
uint requiredInterval = (uint)(_separated.nextCompletedWorkRequestsTime - priorTime);
uint elapsedInterval = (uint)(currentTimeMs - priorTime);
What an interesting and insidious bug! Bugs based on a time calculations can often show themselves later when view through a longer lens and scope of time...sometimes WAY longer than you'd expect.
Sponsor: YugabyteDB is a distributed SQL database designed for resilience and scale. It is 100% open source, PostgreSQL-compatible, enterprise-grade, and runs across all clouds. Sign up and get a free t-shirt.
© 2021 Scott Hanselman. All rights reserved.