I wonder if bisect is the optimal algorithm for this kind of case. Checking for the error still existing still takes an average of ≈500 iterations before a fail, checking for the error not existing takes 10,000 iterations, 20 times longer, so maybe biasing the bisect to only skip 1/20th of the remaining commits, rather than half of them would be more efficient.
Basically it calculates the commit to test at each step which gains the most information, under some trivial assumptions. The calculation is O(N) in the number of commits if you have a linear history, but it requires prefix-sum which is not O(N) on a DAG so it could be expensive if your history is complex.
Never got round to integrating it into git though.
That's a cool idea. Would also be interesting to consider the size of the commit - a single 100-line change is probably more likely to introduce a bug than 10 10-line changes.
There's an additional stopping problem here that isn't present in a normal binary search. Binary search assumes you can do a test and know for sure whether you've found the target item, a lower item, or a higher item. If the test itself is stochastic and you don't know how long you have to run it to get the hang, I'd think you'd get results faster by running commits randomly and excluding them from consideration when they hang. Effectively, you're running all the commits at the same time instead of working on one commit and not moving on until you've made a decision on it. Then at any time you will have a list of commits that have hanged and a list of commits that have not hanged yet, and you can keep the entire experiment running arbitarily long to catch the long-tail effects rather than having to choose when to stop testing a single non-hanging commit and move onto the next one.
I can see some interesting approaches here. Given n threads/workers you could divide the search space into n sample points (for simplicity let's divide it evenly) and run the repeated test on each point. When a point hangs, that establishes a new upper limit, all higher search points are eliminated, the workers reassigned in the remaining search space.
Given the uncertainty I can see how this might be more efficient, especially if the variance of the heisenbug is high.
If the factor in one direction is large enough then a linear search becomes more efficient. Say you have 20 commits remaining and the factor is 1,000x more costly to make it easier to picture. You're better off doing a linear search which guarantees you'll spend less than 2,000x searching the space.
That suggests that for a larger search space with a large enough difference, the optimal bisection point is probably not always the midpoint even if you know nothing about the distribution.
Perhaps someone can find the exact formula for selecting the next revision to search?
Each boot updates your empirical distribution. As a trivial example, if you have booted a version 9999 times with no hanging, a later version will likely give you more information per boot.
If they boot it 10,000 times for revisions that don't fail, and ~1,000 times for revisions that do fail, you can reach this number with log2(revisions) about 30.
I didn't mention it in the blog, but Paolo Bonzini was helping me and suggested I run the bootbootboot test for 24 hours, to make sure the bug wasn't latent in the older kernel. I got bored after 21 hours, which happened to be 292,612 boots.
Maybe it would have failed on the 292,613rd boot ...
Thanks for mentioning me, but you really did the work!
But in order to contribute something useful, as a rule of thumb you want to have 10 times as many passes than failures in order to reject a commit. If a bug has taken up to 2500 runs to reproduce, don't consider it a pass until 30000 runs have succeeded.
It's something to do with Poisson distributions. If you have 𝑛 runs before a failed run on average, and
you want to be 𝑃 % certain that a fix (including a revert or moving beyond the bug in a bisect) reduced the failure
rate, you can use the formula −
𝑛 ln (1 − 𝑃
/100) for how long to run, and the factor for 𝑃=99.99 is about 10.
In fact that means that once you had landed on a merge commit it was probably much better to switch to a linear backwards search because it might have fewer passing runs and passing runs are 10-15 times more expensive as failures. Is that what you did?
I've been on a similar quest for hard to reproduce, timing/hardware/... bugs, and if you're facing any kind of skepticism (your own or otherwise) it can be very comforting to have a 10x or even 100x no failure occurred confidence.
It's particularly comforting when the reason for the failure/fix/change in behavior isn't completely understood.
If the bug occurs reasonably often, say usually once every 10 minutes, you can model an exponential distribution of the intervals between the bug triggering and then use the distribution to "prove" the bug is fixed in cases where the root cause isn't clear: https://frdmtoplay.com/statistically-squashing-bugs/