I’ve gotten a lot of positive feedback for my last article, which I really appreciate. Jon Finkel even commented, which was cool and unexpected. The primary criticisms I heard were that a lot of the hands I counted in the MCS were lower quality than the hands I’d get after hitting with the example hand, and that I needed to better distinguish hand quality. For example, my analysis weighted a six-land hand with Tron equally to the example hand hitting Tron, even though the latter is better. This is a key simplification in my analysis rooted in how I approach mulliganing in general, but I didn’t want to dive too deeply into technical details since my mulligan system wasn’t the focus of the article.

My goal in this article is to better explain my approach to mulliganing, how to use that approach, and why I use it.

Mulliganing

When you’re deciding whether to mulligan, what you’re really trying to figure out is whether you improve your chances of winning by mulliganing rather than keeping. But even after playing the same deck for hundreds or even thousands of games, you’ll probably never see the exact same hand twice, let alone enough times to form a meaningful sample. This variability is why Magic is so replayable, but it makes rigorous statistical reasoning difficult.

Mulliganing itself is also inherently a complicated process. When you mulligan, you lose a resource, get a new set of cards, and you get the opportunity to mulligan again. Hypothetically, even if you had a database of every game of Magic ever played, the starting hands in those games and the outcomes, finding an optimal mulligan policy would still be incredibly difficult. You’d have to work out how often you win with your starting hand, how often you’ll win with the six card hands you’ll keep, how often you should keep your six-card hands, and so on. And what if you were playing a deck nobody had ever played before?

I grew up reading PVDDR’s articles on the subject, and his favorite format was polling different players for their opinions on difficult hands to show how many different factors can go into deciding when to mulligan. Even the best players still typically agree to disagree on difficult hands. This isn’t because some players are better or worse at mulliganing, but because mulliganing is such an infinitely complicated puzzle. Two of PV’s articles can be found here and here.

Over time, I’ve developed a personal system for mulliganing that is more transparent and verifiable, and leads to more productive discussions.

Good and Bad Draws

When I pick up a deck, I start by developing a mental classification system of “good draws” and “bad draws.” A draw includes a starting hand and a couple of draw steps, and a “good draw” executes the game plan I expect of my deck. This system might sound simplistic, but it should code your understanding of the format, your deck, and your game plan.

Your standards should be higher for Modern than for Limited, for instance. In Limited, “good draws” for typical decks should be three to six lands and any reasonable assortment of spells. In Modern, a “good draw” should contain some of your best cards and solid support.

Your notion of a “good draw” should change depending on what deck you’re playing as well. You should have higher standards for an aggressive deck than a controlling deck. If your deck doesn’t have many 2-drops, then you should prioritize having a 2-drop less and value draws featuring a 2-drop more highly. In Constructed, if you’re playing a deck that features a lot of 1-for-1 exchanges like discard or removal, then most of your draws will functionally be very similar. If you’re playing a synergistic deck, then specific cards distinguish a draw.

This system is useful because it allows you to focus on finding the threshold between good draws and bad draws. When discussing your classifications with your friends, you can point to specific draws and ask whether they would consider those draws “good” or “bad,” then discuss why. If you disagree, then you can identify the class of draws that you disagree about. For example, perhaps your friend thinks that any draw that doesn’t have a 1-drop is unacceptable with an aggressive deck, but you default to keeping a wider range of hands. Then if you find that you win more often with good seven-card hands without a 1-drop than with six-card hands featuring a 1-drop, that’s meaningful evidence. The key is isolating a sufficiently large set of hands that you and your friend disagree on where you can measure the difference.

Tracking these statistics manually is a lot of work, and I don’t suggest it unless you actively enjoy it. But every time you have a draw that you think you and your friend would disagree on, make a note of it and the outcome of the game. Maybe take a screenshot to discuss with your friend later. Add the result to your mental scoreboard. Over time, by just paying attention to hands you and your friend disagree on, you improve your understanding of the right threshold.
If you respect your friend’s opinion, or consider them a better player than yourself, then you can also just take your friend’s disagreement as a strong signal in itself.

The long-term goal isn’t to find perfect classification systems for particular decks but to be able to better intuit and develop the right threshold for a new deck based on your body of knowledge. At the same time, the best way to train your intuition is to find good classification systems for a wide variety of decks.

Probability of a Good Draw

Once you have a clear sense of what comprises a good draw, you can look at a opening hand and calculate the probability of getting a good draw from that hand. With a well-constructed classification system and deck, you’ll always get a good draw from a lot of seven-card hands. That’s fine. Those are the easy keeps.

With borderline hands, however, a clear understanding of what constitutes a good draw helps you enumerate the number of outs you have. With that, you can calculate the probability you hit. If the number of outs is constant from draw step to draw step, you can use a hypergeometric calculator with K = the number of outs, N = the number of cards in your deck, n = the number of draws, and k = 0. You want the 1 minus the probability density function.

At the table, a reasonable approximation is 1 – (1 – K / N)^n, rounding as necessary. For example, if k = 15 and n = 33 (a typical case for a two-lander in Limited), then you can round 18/33 to 1/2. Then for n = 2, your chances of hitting are 3/4 = 0.75. If you want to go a step further, you can recognize that 1/2 is slightly less than 18/33, so this is a slight over-approximation. The actual probability from the hypergeometric distribution is 0.7102.

If the number of outs varies, you’ll have to use reasoning similar to the math outlined in my last article.

As you practice with these calculations, you’ll develop a better sense for what the numbers are for typical numbers of outs, draws, and deck sizes. You could also make a reference spreadsheet, if you’re into spreadsheets. (Here’s an example.)

Mulligan Threshold

Lastly, you’ll want to develop a threshold for hands to mulligan. You’ll mulligan hands that are less likely to lead to a good draw than your threshold, and keep hands that are more likely. The higher your threshold, the pickier you are.

This threshold essentially codes how valuable a card is, both to the deck you’re piloting and in the format you’re playing. The same principles that apply for what constitutes a good hand also apply to deciding mulligan thresholds. Fast formats and decks should have higher thresholds and grindy formats and decks should have lower thresholds, for instance.

A good threshold depends on how selective your classification system is, but useful rules of thumb are t = 0.4 in Limited and t = 0.6 in Constructed.
My last article focused specifically on this part of my process. I was using MCS to better decide what my mulligan threshold for Eldrazi Tron should be. My simulations led me to raise my threshold not just for Eldrazi Tron, but across the board.

Advantages and Disadvantages

Again, the primary disadvantage of my system is that it doesn’t distinguish hand quality, as several commenters (including JFM) noted. Mathematically, I’m essentially using an identity function as my utility metric. But this simplification dramatically reduces the computational difficulty while preserving a lot of key complexities. You reduce the problem of projecting win rate to tuning your classification systems well and remaining cognizant of their shortcomings.
Hypothetically, you could use a proper utility function that maps each hand to the real line, corresponding to its quality. This would be the next level. I’m not sure how to best construct a function like that though.

Conclusion

My system has two primary parameters: draw quality threshold and mulligan threshold. Once you set those thresholds, the calculations all follow naturally. While my system certainly isn’t perfect, it’s helped me think about my decisions more clearly and guided my discussions with my friends.

Ultimately, was the example hand from the last article a keep or a mulligan? It’s certainly possible that I counted too many weak hands in my simulation and raised my threshold too much. At the same time, I think most commenters significantly underestimated how valuable the opportunity to mulligan again is. The right decision isn’t clear. But I hope it’s more clear why I asked the questions I did.