That README is light on details. How is this different from selecting some N (and hoping it is high enough) and repeating your test case that many times? You just don't have to select a value for N using this tool?
The paper lists the algorithm (which is relatively simple) but basically it is much more efficient than repeating test cases.
You can see that that must be possible fairly easily. Consider two algorithms:
1. Classic binary search - test each element once and 100% trust the result.
2. Overkill - test each element 100 times because you don't trust the result one bit.
The former will clearly give you the wrong result most of the time, and the latter is extremely inefficiency. There's clearly a solution that's more efficient without sacrificing accuracy in-between.
Skimming the algorithm, it looks like they maintain Bayesian probabilities for each element being "the one" and then test an element 50% probability point each iteration, then update the probabilities accordingly. Basically a Bayesian version of the traditional algorithm.
You do still have to select an N, but it's not as critical that the N gives 100% guarantee of the flaky failure (which can be really difficult or even impossible to achieve). Unlike regular binary search, robust binary search doesn't permanently give up on the left or right half based on just a single result.