Table for detecting significant difference between two engines

(from a CCC post by Joseph Ciarrochi)

This table shows the percentage scores needed to conclude one engine is likely to be better than the other in head to head competition.

Games Cutoff = 5% Cutoff = 1% Cutoff = 0.1%

10 75 85 95

20 67.5 75 80

30 63.3 70 73.3

40 62.5 66.3 71.3

50 61 65 68

75 58.6 61.3 66

100 57 60 63

150 55.7 58.3 60

200 54.8 57 59.8

300 54.2 55.8 57.5

500 53.1 54.3 55.3

1000 52.2 53.1 54.1

Notes

Based on 10000 randomly chosen samples. Thus, these values are approximate, though with such a large sample, the values should be close to the "true" value.
Alpha (cutoff) represents the percentage of time that the score occurred by chance (i.e., occurred, even though we know the true value to be 0.50, or 50%). Alpha is basically the odds of incorrectly saying two engines differ in head to head competition.
Traditionally, 0.05 alpha is used as a cut-off, but I think this is a bit too lenient. I would recommend 1% or 0.1%, to be reasonably confident.
Draw rate assumed to be .32 (based on CEGT 40/40 draw rates). Variations in draw rate will slightly effect cut-off levels, but I don't think the difference will be big.
Engines assumed to play equal numbers of games as white and black.
In cases where a particular score fell both above and below the cutoff, then the next score above the cutoff was chosen. This leads to conservative estimates (e.g., for n of 10, a score of 7 occurred above and below the 5% cutoff. Therefore , 7.5 became the cut-off).
Type 1 error = saying an engine is better in head to head competition, when there is actually no difference. The chance of making a type 1 error increases with the number of comparisons you make. If you conduct C comparisons, the odds of making at least one type 1 error = 1 – (1-alpha)^C (^ = raised to the power of).
It is critical that you choose your sample size ahead of time, and do not make any conclusions until you have run the full tournament. It is incorrect, statistically, to watch the running of the tournament, wait until an engine reaches a cut-off, and then stop the tournament.
The values in the table assume that you are testing a directional hypothesis, e.g., that engine A does better than B. If you have no idea of which engine might be better, then your hypothesis is non-directional and you must double the alpha rate. This means that if you select the 0.05 criteria, and you have a non-directional hypothesis, you are in fact using a 0.1 criteria, and if you choose the 0.01 criteria, you are using the 0.02 criteria. I recommend using at least the 0.01 criteria in these instances, and preferrably using the 0.1 criterium.
Even if you get a significant result, the result may not generalize well to future tests. One important question is: to what extent are the openings you used in your test representative of the openings the engine would actually use when playing? I think there is no way you can get a representative sample of opening positions with only, say, ten openings. You probably need at least 50 different openings. If you are going to use a particular opening book with an engine, it would be ideal to sample a fair number of different openings from this opening book.

Games	Cutoff = 5%	Cutoff = 1%	Cutoff = 0.1%
10	75	85	95
20	67.5	75	80
30	63.3	70	73.3
40	62.5	66.3	71.3
50	61	65	68
75	58.6	61.3	66
100	57	60	63
150	55.7	58.3	60
200	54.8	57	59.8
300	54.2	55.8	57.5
500	53.1	54.3	55.3
1000	52.2	53.1	54.1