I wonder if bisect is the optimal algorithm for this kind of case. Checking for ...

ajb · on June 14, 2023

There is actually a bayesian version which I wrote: https://github.com/ealdwulf/bbchop

Basically it calculates the commit to test at each step which gains the most information, under some trivial assumptions. The calculation is O(N) in the number of commits if you have a linear history, but it requires prefix-sum which is not O(N) on a DAG so it could be expensive if your history is complex.

Never got round to integrating it into git though.

defen · on June 14, 2023

That's a cool idea. Would also be interesting to consider the size of the commit - a single 100-line change is probably more likely to introduce a bug than 10 10-line changes.

thaliaarchi · on June 15, 2023

The commit that caused OP's bug changed two lines.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

phist_mcgee · on June 14, 2023

You haven't met the developers at my last company.

muxator · on June 14, 2023

Hidden gem! Thanks!

pacaro · on June 14, 2023

Biasing a binary search would only be beneficial if you know something about the distribution of the search space

electroly · on June 14, 2023

There's an additional stopping problem here that isn't present in a normal binary search. Binary search assumes you can do a test and know for sure whether you've found the target item, a lower item, or a higher item. If the test itself is stochastic and you don't know how long you have to run it to get the hang, I'd think you'd get results faster by running commits randomly and excluding them from consideration when they hang. Effectively, you're running all the commits at the same time instead of working on one commit and not moving on until you've made a decision on it. Then at any time you will have a list of commits that have hanged and a list of commits that have not hanged yet, and you can keep the entire experiment running arbitarily long to catch the long-tail effects rather than having to choose when to stop testing a single non-hanging commit and move onto the next one.

pacaro · on June 14, 2023

I can see some interesting approaches here. Given n threads/workers you could divide the search space into n sample points (for simplicity let's divide it evenly) and run the repeated test on each point. When a point hangs, that establishes a new upper limit, all higher search points are eliminated, the workers reassigned in the remaining search space.

Given the uncertainty I can see how this might be more efficient, especially if the variance of the heisenbug is high.

bgirard · on June 14, 2023

If the factor in one direction is large enough then a linear search becomes more efficient. Say you have 20 commits remaining and the factor is 1,000x more costly to make it easier to picture. You're better off doing a linear search which guarantees you'll spend less than 2,000x searching the space.

That suggests that for a larger search space with a large enough difference, the optimal bisection point is probably not always the midpoint even if you know nothing about the distribution.

Perhaps someone can find the exact formula for selecting the next revision to search?

jwilk · on June 14, 2023

> You're better off doing a linear search which guarantees you'll spend less than 2,000x searching the space.

Almost. If only the last commit is slow, binary search is still faster.

bgirard · on June 14, 2023

> better off

Better off as in expected/average case. Good point, but only marginally better in the worse case.

mortehu · on June 14, 2023

Each boot updates your empirical distribution. As a trivial example, if you have booted a version 9999 times with no hanging, a later version will likely give you more information per boot.

coldtea · on June 14, 2023

Still, why would they need to reboot 292,612 times?

Is that supposed to be the log of the commit messages space?

remram · on June 14, 2023

If they boot it 10,000 times for revisions that don't fail, and ~1,000 times for revisions that do fail, you can reach this number with log2(revisions) about 30.

x86x87 · on June 14, 2023

read the article. they booted so many times to show that it was not reproducing. it's overkill but you don't need to boot 200k times

rwmj · on June 14, 2023

I didn't mention it in the blog, but Paolo Bonzini was helping me and suggested I run the bootbootboot test for 24 hours, to make sure the bug wasn't latent in the older kernel. I got bored after 21 hours, which happened to be 292,612 boots.

Maybe it would have failed on the 292,613rd boot ...

bonzini · on June 15, 2023

Thanks for mentioning me, but you really did the work!

But in order to contribute something useful, as a rule of thumb you want to have 10 times as many passes than failures in order to reject a commit. If a bug has taken up to 2500 runs to reproduce, don't consider it a pass until 30000 runs have succeeded.

It's something to do with Poisson distributions. If you have 𝑛 runs before a failed run on average, and you want to be 𝑃 % certain that a fix (including a revert or moving beyond the bug in a bisect) reduced the failure rate, you can use the formula − 𝑛 ln (1 − 𝑃 /100) for how long to run, and the factor for 𝑃=99.99 is about 10.

In fact that means that once you had landed on a merge commit it was probably much better to switch to a linear backwards search because it might have fewer passing runs and passing runs are 10-15 times more expensive as failures. Is that what you did?

rwmj · on June 15, 2023

> it was probably much better to switch to a linear backwards search

Ha ha, nope! I tested each commit starting at the earliest, and it was the last one in the merge :-(

opello · on June 14, 2023

I've been on a similar quest for hard to reproduce, timing/hardware/... bugs, and if you're facing any kind of skepticism (your own or otherwise) it can be very comforting to have a 10x or even 100x no failure occurred confidence.

It's particularly comforting when the reason for the failure/fix/change in behavior isn't completely understood.

bsilvereagle · on June 14, 2023

If the bug occurs reasonably often, say usually once every 10 minutes, you can model an exponential distribution of the intervals between the bug triggering and then use the distribution to "prove" the bug is fixed in cases where the root cause isn't clear: https://frdmtoplay.com/statistically-squashing-bugs/

quickthrower2 · on June 14, 2023

I think your p value is pretty good here

bonzini · on June 15, 2023

With about 1000 runs to reach a failure I think he has p=0.000001 or something like that.

x86x87 · on June 15, 2023

this is unacceptable :):):) only 21 hours!