Two players are playtesting a Magic matchup between deck A and deck B. After 10 games, deck A has won 7 and lost 3.
“All right, the matchup is 70% in my favor. We’re done here,” player A jubilantly says.
“I don’t think so,” player B interjects, tossing a statistics textbook on the table. “Sure, 70% is the best estimate based on our games, but that sample size is way too small to make claims so confidently.”
“Well, how many games do we actually need?” player A questions.
Let’s run the numbers!
The Binomial Distribution
If you test a matchup for n independent games and your game win probability p is constant, then the number of wins in that set of games is described by a binomial distribution. It’s a classic in probability theory, and there are several convenient online calculators available.
Using such a calculator, we can, for example, determine that if your actual probability of winning a single game is 0.5 (i.e., 50%) and you play 10 games, then the probability of winning 7 of them is 0.117 (i.e., 11.7%). The probability of winning at least 7 games is 0.172 (i.e., 17.2%). Hence, lopsided results will regularly happen by chance.
In practice, the value of the game win probability p is unknown. Our aim is to estimate its value (and the uncertainty around that value) by playtesting games and recording the results.
A note on terminology: Throughout the majority of this article, I will consider “games” under the assumption that the game win probability remains constant. That is, there are no play-draw dependencies and no sideboards. Every playtest session can be modeled as a binomial distribution. If you record pre-board and post-board results and are interested in matches, then you can construct a match win probability formula that even takes into account play-draw dependencies, but the determination of its statistics goes beyond the scope of this article. But if you record match results only and substitute “game” for “match” throughout this article, then all insights will remain valid. This is also true if the opposing deck is changed to a fixed metagame of possible matchups.
Hypothesis Testing on a Parameter
Sometimes we are only interested in one thing: Are we favored in the matchup or not? To statistically test this in a one-tailed way (i.e., assuming that we are definitely not unfavored) we can set up a null hypothesis saying that the matchup is even (i.e., p=0.5) and an alternative hypothesis saying that we are favored (i.e., p>0.5). No claims are made regarding the a priori likelihood of these two hypotheses or on how favored we might be.
Then, on the basis that the null hypothesis is true, we can calculate the probability of seeing a result that is at least as extreme as what we observed. This is what statisticians refer to as a p-value, and you may loosely think of it as the probability that the observation is “just variance”. If this probability is low, then it can be interpreted as evidence against the null hypothesis. In our case, we won 7 out of 10 games, and as we already determined, the corresponding probability of winning at least 7 games is 0.172.
Now, is a probability of 0.172 unlikely enough? Based on the customary threshold of 0.05 or lower, the answer would be no—the observation is consistent with the null hypothesis. In other words, since 0.05<0.172, by the usual standards of statistical significance we cannot reject the null hypothesis. It’s akin to assuming a suspect is innocent because guilt hasn’t been proven beyond a reasonable doubt.
TL;DR: A 7-3 result is not enough to conclude we are favored with statistical significance.
Hypothesis Testing on a Difference
Hypothesis testing is also useful when we wish to compare the test results for two different decks. Take, for instance, the Standard win rates reported in the latest B&R update. Although sample sizes weren’t given, we can estimate them, or at least their order of magnitude, based on how many 5-0 trophies are held in the Magic Online competitive Standard league after a certain number of days. Taking some leeway, let’s suppose that B/R Aggro had 5,190 wins (51.9%) over 10,000 matches and that G/b Steel Leaf had 1,114 wins (55.7%) over 2,000 matches.
Is this a significant difference? Given that the proportions are large enough for a normal approximation, we can do a z-test in a two-tailed way. The null hypothesis would be that B/R Aggro has the same win rate as G/b Steel Leaf, and the alternative hypothesis would be that they are different. Using one of the many online calculators available, we find that if the null hypothesis were true, then the probability of observing a difference at least as extreme as what we observed (a.k.a. the p-value) is extremely low—far below any reasonable threshold.
Hence, we reject the null hypothesis and conclude that there is indeed a plausible difference between the two.
But if the sample sizes were 1/10th the size, then a 3.8% win rate difference between B/R Aggro and G/b Steel Leaf over the resulting total of 1,200 games would not be statistically significant. (The corresponding p-value would be 0.33.) This already provides some intuition regarding the extremely large amount of games necessary to detect a small difference in win rates…and it also implies that under customary values for statistical significance, you can’t realistically decide the last few cards in your deck purely based on playtest results.
If you are at the beginning of a playtest session where you hope to detect a certain win rate difference between two decks (against a fixed opposing deck) and can decide how many games to play, then you can calculate a minimum required sample size for each deck. It depends on a lot of factors, but if we desire a 95% confidence level, assume that the win rate of one deck is 0.5, consider a two-tailed hypothesis, use equal sample sizes for each deck, and desire an 80% probability of correctly rejecting a false null hypothesis, then an online calculator whose results matched the ones from my old statistics textbook prescribes 2713 games with each deck (i.e., 5426 total) to detect a 3.8% win rate difference.
To detect an estimated 10% win rate difference, which is already humongous in Magic terms, you should plan to play 387 games with each deck (i.e., 744 total). Hope you have a few weeks available.
TL;DR: Given that the difference in win rates between two good decks is usually no more than a few percent, you need many hundreds or even thousands of games to detect such a difference with statistical significance.
The probabilistic abstractions in hypothesis tests can be difficult to interpret. A more informative and more easily presented estimate is a confidence interval: a range of values, based on the observed data, that is likely to contain the unknown population parameter. In the case of a single matchup, the unknown population parameter is the true probability p of winning a game, and the confidence interval will be based on the number of games n, the number of observed wins w, and the specified confidence level. The confidence level (commonly chosen to be 95%) represents the confidence, loosely speaking, that any such interval will contain the true value of p.
For technical reasons, there is no one “perfect” way of defining such a 95% confidence interval. The discrete nature of the binomial distribution complicates things, and it depends on how you interpret the notion of a confidence level. Numerous academic papers have been written on this topic, but for the purpose of this article, I’ll just introduce two of the most commonly used methods.
The first one, based on the central limit theorem, is a normal approximation, and it is given by w/n ± 1.96 * sqrt[w/n*(1-w/n)/n]. It is relatively easy to calculate and, given the matchup values typically seen in Magic, many textbooks would call it a good approximation when the number of games n is 20 or more. But it can be an inaccurate and aberrant estimator when we play fewer than 20 games, which happens all too often in playtesting.
The second one, called the Clopper–Pearson method, is based on the actual binomial distribution. Although its intervals are usually unnecessarily wide because the method conservatively guarantees minimum coverage over the parameter space, it is suitable for small sample sizes. Statistical packages or online tools are generally used to run the calculations. Using such an online tool for our example with w=7 and n=10, we can find that the 95% confidence interval ranges from 0.348 to 0.933.
This means that after our 7-3 result, we are at least 95% confident that the true game win probability p lies between 0.348 and 0.933. That’s an interval length of 0.585 units—a pretty darn wide range. And it doesn’t shrink that much if we switch to a more lenient method or a 90% confidence level. If we want to substantially increase precision, we simply need to play more games.
TL;DR: After a 7-3 result, we can be at least 95% confident that the true game win probability lies between 0.348 and 0.933.
What is a Suitable Sample Size for Confidence Intervals?
It depends on the desired confidence level, on a potential prior estimate of p, and on the chosen method for determining intervals. But if we conservatively assume that p and thus w/n is close to 0.5 (which maximizes the required sample size but represents a reasonable initial estimate for non-lopsided matchups), use the normal approximation (which is reasonable when we know the planned sample size will be large), and desire 95% confidence (as usual), then we can solve the corresponding 95% confidence interval length formula for n.
Under the above-stated assumptions, this interval length will be approximately equal to 2*1.96*sqrt(0.5*(1-0.5)/n). If we replace 1.96 by 2 for ease of presentation and specify that we want an interval no wider than L units, the result is strikingly simple: we need a sample size of approximately n=4/(L^2).
What does that mean in practice? Let’s plug in some values. If we want a margin of error of +/- 10%, i.e., a 95% confidence interval for p of no more than L=0.20 units wide, then this formula for n states that we need to play approximately 100 games. If, for instance, we would play those 100 games, win 60, lose 40, and use the normal approximation to determine the resulting interval, we would conclude with 95% confidence that the true game win probability p lies between 0.50 and 0.70. That’s still quite wide, even after playing 100 games!
The amount of games required for smaller intervals is staggering. If we want a margin of error of +/- 5%, i.e., a 95% confidence interval no more than 0.10 units wide, then we need to play approximately 400 games. Note that this is similar to the minimum number of games required for one deck to detect a 10% win rate difference with another, as we found earlier.
If we want a margin of error of +/- 2%, i.e., a 95% confidence interval no more than 0.04 units wide, then we need to play approximately 2500 games! To put that into perspective: if you could test 6 games per hour and play 14 hours a day, you would need an entire month to finish that many games.
As a closing remark: if you have information that the value of p is far from 0.5, either from a preliminary sample or from past experience, then you could use your estimate in the corresponding confidence interval length formula and ultimately obtain a smaller required sample size (because variance is maximized for p=0.5). But if I have such strong priors, then I would rather use a different perspective altogether.
TL;DR: If you want a 95% confidence interval for the true game win probability no wider than L units, you should play approximately 4/(L^2) games.
A Bayesian Approach
From a Bayesian perspective, the true game win probability p is considered to be a random variable, with a certain prior distribution. This contrasts with the previous approaches that made no assumption about a distribution for p. If we have such a prior distribution and observe the outcome of a playtest session, then we can use Bayes’ rule to obtain a posterior distribution.
Often, a Beta distribution is used as a prior on p because it has the appealing property that the posterior distribution is in the same family. But it can be hard to interpret the shape parameters of a Beta distribution, and I’ve found it more educational to introduce the concept of Bayesian, updating with a simple discrete example.
Let’s suppose that you are interested in the game win probability of your deck in a certain matchup, and you ask 10 members of your playtest team to provide an initial estimate. They can base this on past experience and/or an analysis of the deck lists. Often, Magic players will have at least some rough idea. Suppose that one team member would give 0.40 as their estimate, five would say 0.60, and four would say 0.80.
This would lead to the following prior:
- With probability 10%, p is equal to 0.4
- With probability 50%, p is equal to 0.6
- With probability 40%, p is equal to 0.8
This discrete distribution has an expected value of 0.66, indicating an overall initial belief that we are probably heavily favored in the matchup.
Now suppose that we play a 10-game set and win 7 of them. The likelihood of this happening as a function of our prior can be found via the binomial distribution:
- If p is actually 0.4, then we’d go 7-3 with probability 4.2%.
- If p is actually 0.6, then we’d go 7-3 with probability 21.5%.
- If p is actually 0.8, then we’d go 7-3 with probability 20.1%.
On the whole under our prior distribution, we’d go 7-3 with probability 10%*4.2% + 50%*21.5% + 40%*20.1% = 19.2%.
Using all this, we can update to the posterior distribution with Bayes’ rule. Naturally, our belief that the real value of p is 0.6 or 0.8 will increase:
- The probability that p is 0.4, given our prior and the 7-3 result, is 10% * 4.2% / 19.2% = 2.2%.
- The probability that p is 0.6, given our prior and the 7-3 result, is 50% * 21.5% / 19.2% = 55.9%.
- The probability that p is 0.8, given our prior and the 7-3 result, is 40% * 20.1% / 19.2% = 41.9%.
At least on a conceptual basis, this is how I generally view playtest results. I find this perspective intuitive and insightful because it adequately combines prior estimates and results from small samples.
You may not be able to draw sweeping conclusions with a 10-game sample, but it does convey some information. Given that we’re not in a courtroom where heavy proof is required to convict someone, I am happy to update my beliefs based on small observations and tiny sample sizes.
Unfortunately, it can be hard to use the Bayesian approach in a collaborative setting because so much relies on your subjective priors, and other people may not accept them. But I like the way of thinking.
TL;DR: If you have prior beliefs on the game win probability before playtesting, perhaps based on an analysis of the deck list, then you can update those beliefs in a Bayesian way after observing the playtest results.
The number of games required for statistical significance in playtesting a single matchup is extremely large, and your sample size is probably insufficient. To get the margin of error regarding the win rate down to +/- 10%, which could correspond to a 95% confidence interval for the true game win probability from 0.50 to 0.70, you would already need to play approximately 100 games. If you find that too wide and want a margin of error of, say, +/- 2%, then you’d better set aside an entire month for non-stop playtesting.
To be fair, statistical limits tend to be strict and only take into account the observed number of wins and losses. If you and your collaborators trust certain prior estimates on the game win probability, as well as playtest observations on how certain cards interact and how the matchup “feels,” then you probably need far fewer games to make confident claims about matchup percentages. And while results are useful, playtesting is never done with the sole aim of estimating the win rate as accurately as possible. Learning how a particular matchup works is more useful. Nevertheless, you have to realize that you’re making deck choices under considerable uncertainty—that’s just life as a Magic player.
I hope you enjoyed this basic analysis in statistical inference. While I took care to ensure accuracy, it’s a tricky topic that other people are more knowledgeable about than me. If you spot an error, please let me know so that I can correct it. Thanks for reading.