HPS 0628 Paradox

Back to doc list

Probability Theory Refresher

John D. Norton
Department of History and Philosophy of Science
University of Pittsburgh

If you are quite new to probability theory, this refresher might not be enough of an account to serve you well. It is written for those who already have a first exposure to probability theory as a reminder of what they have learned and also to help fix the language used in the other chapters.

Outcome Space

The fundamental arena in which all probability calculations proceed is the outcome space. It is the set of all possible outcomes presented in their most specific details. A common and simple example is the set of all possible die roll outcomes:

Probabilities are then assigned to each of these outcomes; and to sets of them.

More complicated cases arise when there are multiple dice rolled. The full outcome space for two dice rolls is of size 36:

  ,     ,     ,   ,     ,
  ,     ,     ,   ,     ,
  ,     ,     ,   ,     ,
  ,     ,     ,   ,     ,
  ,     ,     ,   ,     ,
  ,     ,     ,   ,    

In writing about these outcomes, there are two modes of presentation commonly used. In one, the outcomes are represented as sets or their elements. Thus we might have the set of even die rolls:

{ , ,   }

Alternatively, we might represent the same outcome as a proposition, that is a statement, that asserts what happened.

or    or 

These two modes are, for our purposes, equivalent and I will switch between them as needed.

What complicated matters is when the outcome set is very large, such as when the outcome may be a single real number in some interval, say, 0 to 1. A probability measure is a special case of the additive measures defined in an earlier chapter. We saw there that we may not always be able to assign a measure to every subset of the outcome space. When this problem arises here, we specify which subsets of the full outcomes space have probabilities assigned to them.

Mutual Exclusivity

Two outcomes or sets of outcomes are mutually exclusive when the occurrence of one preclude the occurrence of the other. It is essential that the most specific outcomes of the outcome space be mutually exclusive.

For example, on a die roll outcome of two and five  are mutually exclusive. Outcomes of an even number and an odd number, or or  , are mutually exclusive, since if one happens the other cannot.

The outcomes of an even number , or or , and a number less than four, or or  , are not mutually exclusive. Both can happen if two, , is rolled.

While the idea of mutual exclusivity is straightforward, we shall see that many familiar paradoxes arise from not paying enough attention to it. In particular, we may misidentify probabilities through failure to recognize that two outcomes arise through different numbers of mutually exclusive cases.

For example, in two dice rolls, a sum of two can happen in only one way, and . A sum of seven, however, can happen in seven ways, that is through seven different mutually exclusive combinations:


A roll summing to two has a different probability from one summing to seven.

It is easy to trip up in simple cases. Rolling a one and a one on two dice throws corresponds to one outcome    . Rolling a three and four, however, corresponds to two possible, mutually exclusive outcomes:   .


A probability is an additive measure. It is a special case of the additive measures defined in an earlier chapter. What makes it a special case is that it is normed additive measure. That is, the measure assigned to the full outcome space is one. This would contrast with unnormed additive measures, such as areas. The total area of an infinite Euclidean plane is infinity, not unity. This extra characteristic of a unit norm is central to the probability theory.

The precise conditions that express the additivity of a probability measure "P" are:

Sum rule: If outcomes A and B are mutually exclusive, then
P(A or B) = P(A) + P(B)

Unit norm:
If the entire outcome space is Ω, then P(Ω) = 1

Just as is the case with other additive measures, the summations can be chained, so that any finite number of mutually exclusive outcomes "or'ed" together. Again, as with other additive measures, summation can also be defined for a countable infinity of outcomes. It is an added condition that is logically independent of the finite additivity defined above.

Informally, additivity means that we can determine the probability of any outcome by adding up the probabilities of its mutually exclusive components:

P( or  or ) = P( ) + P() + P( )
     = 1/6 + 1/6 + 1/6    
           = 1/2
P( or  or ) = P( ) + P() + P( )
      = 1/6 + 1/6 + 1/6
             = 1/2

Applying additivity again:
P( or  or or or  or )
     = P( or  or ) + P( or  or ) = 1/2 + 1/2

This last probability is one since it is the probability of the whole outcome space.

These last determination depend on the values of the most specific outcomes (also called the "atoms") in the outcome space. Dice are constructed and rolled symmetrically with great care to ensure the equality of probability of each individual face:

P( ) = P(   ) = ... = P( ) = 1/6

Nothing requires the probabilities of the most specific outcomes to be equal. That they are commonly so is in part an accident of history and part a convenience. The probability theory was developed initially for games of chance. Each individual outcome in these games commonly has equal probability. The theory did not employ probability directly, but used a surrogate for it, a simple count of the number of mutually exclusive outcomes comprising the outcome of interest. We continue to use the example of equiprobable die rolls, coin tosses and card deals because of their simplicity.

If instead of using a perfectly symmetric, cubical die, we used an asymmetric die, then the most specific outcomes would no longer be equal. For example, consider what happens if we roll an asymmetric die such as the one shown:

The probability of a six is less than the probability of a one. For to land showing a six, the die must come to rest, unstably, with the smaller face of one, down. To land showing one, however, it must come to rest on the more stable face of six, down.

P( ) < P( )


The notion of independence is as fundamental as the notion of mutual exclusivity. It is, roughly speaking, its opposite. If two outcomes are independent, then the occurrence of one does not affect the occurrence of the other. A simple and familiar example arises with the tossing of two dice: the outcome of the toss on the first does not affect the outcome of the second toss.

We will see that failure to appreciate just how independence appears leads to some interesting paradoxes in probability. Similarly, failure to recognize its absence can also lead to paradoxes.

Conditional Probabilities and the Product Rule

Independent outcomes

The product rule enables us to compute the probability of a conjunction--the "and"ing--of outcomes. The simplest case arises with independent outcomes:

Product rule for independent outcome: If outcomes A and B are independent, then P(A and B) = P(A) x P(B)

This rule is routinely applied to easy cases in dice rolls. For outcomes on two different dice:

P( on die 1 and  on die 2)
        = P( on die 1) x P( on die 2)
                 = 1/6 x 1/6 = 1/36

It is essential here that the two outcomes are independent and that is assured since they are the results of rolls on different dice.

An example that will be important for the several of the paradoxes is the probability of many, sequential coin tosses. Consider the outcome of, say, five tails, on successive tosses, followed by a head. Since they are independent, we have:

P() x P() xP()
P() x P() x P()
= 1/2 x 1/2 x 1/2 x 1/2 x 1/2 x 1/2 = 1/26 = 1/64

Outcomes that are not independent

If the two outcomes under consideration are not independent, then we need a new rule that employs a conditional probability. We interpret the expression "P(A|B)" to designate the probability that A occurs given that B has occurred. it is defined as

Conditional probability P(A|B) = P(A and B)/P(B)

In forming P(A|B) we are moving from one outcome space to another smaller outcome space. That is, we take the original outcome space and remove all those outcomes incompatible with B.

For example, the original outcome space might be the space of single die rolls:

, , , , ,

We conditionalize on

B=(the outcome is an even number).

The new outcome space is formed by removing all the odd numbered outcomes to leave us with a reduced space:

, ,

We can compute the conditional probabilities as follows:

P( | even) = P( | or or )
    = P( and ( or or )) / P( or or )
    = P( ) / P( or or )
                        since and ( or or ) =
    = (1/6) / (1/2)
    = 1/3

P( | even) = P( | or or )
    = P( and ( or or )) / P( or or )
    = 0 / P( or or ) = 0
          since and ( or or ) = contradiction,
                    which has probability zero.

Using conditional probabilities, we arrive at a rule that applies whether or not A and B are independent:

General Product rule:  P(A and B) = P(A) x P(B|A) = P(A|B) x P(B)

To illustrate the rule, consider the probability of
A = outcome is prime = or or
B = outcome is even = or or

P(A and B) = P(A) x P(B|A)
                 = P( or or ) x P( or or | or or )
                  = 1/2 x 1/3 = 1/6

Generalized Sum Rule

Once the product rule is defined, we can generalize the sum rule stated above to cover cases in which the two outcomes involved, A and B, are not mutually exclusive. When mutual exclusivity fails, if we were just to add up the probabilities P(A) and P(B), we would be counting twice in the sum the probabilities of those outcomes in (A and B). For the probability of A and B, P(A and B), would already be summed into each of P(A) and P(B).

Generalized sum rule:
P(A or B) = P(A) + P(B) - P(A and B)

For example, if A = or , B = or , then
P(A or B) = P(A) + P(B) - P(A and B)
              = P( or ) + P( or ) - P()
              = 1/3 + 1/3 - 1/6 = 1/2

The rule subtracts P(A and B) = P() from the sum of the two probabilities P( or ) plus P( or ) to correct for the double counting of P().

Total Probability

Conditionalization takes us from a larger outcome space to a smaller outcome space. The rule of total probability allows us to reverse the process and assemble the probabilities in a larger outcome space from those in a smaller outcome space.

The rule applies when we have two mutually contradictory propositions, B and not-B, that exhaust the possibilities. Each is associated with a smaller probability space with probabilities P(.|B) and P(.|not-B). We can combine them for some outcome A according to:

Rule of Total Probability (binary case):
P(A) = P(A|B) x P(B) + P(A|not-B)x P(not-B)

The more general case arises when we have a set of mutually exclusive propositions B1, ... , Bn, whose conjunction exhausts the possibilities:

Rule of Total Probability (general case):
P(A) = P(A|B1) x P(B1) + ... +  P(A|Bn) x P(Bn)

Law of Large Numbers

The connection to frequencies

There is a close connection between probabilities and frequencies. All the above rules for probability apply also to frequencies. Consider the sum rule, for example. Imagine that we roll a die 100 times and count up the numbers of , , , , and . These number counts must add up to 100, the total number of throws.

number() + number()number()
number() + number() + number() = 100

Divide both sides by 100 to convert the number count into the frequency of each outcome using

frequency() =number()/ 100

We now have:

frequency() + frequency()
     + frequency() + frequency()
           + frequency
() + frequency() =1

This is just the sum rule. The other rules above can all be recovered by similar reasoning.

It is tempting to say, then, that the rules of probability are justified by these sorts of frequency counts. The difficulty, however, is that we routinely make  assertions about probabilities that go beyond what these actual frequencies provide.

In this case, we also assert that

probability() = ... = probability() = 1/6.

The frequencies of the different outcomes will rarely conform with this set of equalities. The frequencies of each outcome will in general be different. In the case of 100 rolls, they must be different. Equality would require each frequency to be 100/6 = 16.666... Since this value is not a whole number, it cannot be equal to any frequency.

Large numbers

These last considerations show that we cannot simply equate probabilities with frequencies. However there is clearly some connection and an important one. That connection becomes apparent when we consider very large numbers of trials.

When we have only rolled the die a few times, the frequencies of each outcomes will likely differ markedly from the probabilistic value of 1/6. Or if we have tossed a fair coin only a few times, the frequency of heads will also differ markedly from the probability of 1/2. However, as we increase the number of throws, the frequency of each outcome will stabilize towards the probability. The more the throws, the closer the frequency will come.

Here is a plot of the frequency of heads among the first 50 tosses of just one example of repeated tosses of a fair coin:

This stabilization of the frequency towards the probability of 1/2  is an example of the "weak law of large numbers." Loosely speaking, the law says:

(loose version of weak law of large numbers)
If a trial has a probability of p of a successful outcome and we repeat it many times, such that the trials are independent, then we expect the frequency of successful outcomes to approach arbitrarily closely to the probability p.

This is just a loose statement of the weak law of large numbers, since there are two notions in it not fully specified.

First,"approach arbitrarily closely" needs to be pinned down. It just means this. If the frequency approaches the probability p arbitrarily closely, we mean that, for any small number ε > 0, if we repeat the trials often enough, the frequency will be within ε of the probability p, that is between p-ε and p+ε, and remain so.

Second, "we expect" needs to be made more precise. The difficulty here is that there is no absolute guarantee that the frequency will stabilize to p. In the case of a die roll, for example, there is a 1/6 chance on each roll that the outcome is a . Thus there is a non-zero probability that every outcome is a . The probability will be a huge product:

1/6 x 1/6 x 1/6 x 1/6 x 1/6 x 1/6 x 1/6 x 1/6 x 1/6 x...

and it will become very small, very soon; and eventually come arbitrarily close to zero. No matter how small the probability, even if it is zero, this all outcome remains a possibility. It just becomes something so extremely rare that we should not expect to encounter it.

What the weak law of large numbers tells us is that these sorts of deviations from the approach to p are of low probability and that probability comes arbitrarily close to zero if we repeat the trials often enough. The deviant cases remain possible but outside realm of what we should expect to encounter.

Combining these two complications, we arrive at the following for the weak law of large numbers:

(weak law of large numbers)
A trial has a probability p of success. Choose a probability as close to one as you like and an ε > 0 as close to zero as you like, then, if independent trials are repeated often enough, the frequency of successes will arrive at and remain within ε of p with a probability at least as great as the one you chose.

Expected value

An important use of probabilities is in guiding decisions. If we have to choose among several possible outcomes in the face of uncertainty, a commonly used measure of the desirability of some selection is the expected value.

Consider for example a simple coin tossing game. It will cost a $1 fee per play. In each play, a single coin is tossed. If it comes up heads, the player wins $1 (in addition to the $1 paid being returned). It is comes up tails, the player loses the fee. Should this game be played? Or not? Or some other game?

The expectation of a single play of the game is defined as

Expected value = Probability (head) x net win amount $1
                           + Probability (tail) x net loss amount(-$1)
                          = 1/2 x $1 - 1/2 x $1
                  = $0

That is, the expected value of the game is $0 per game. Thus, by the criterion of expected value, the game is judged a fair game that gives no advantage to the player or the house offering the game.

The more general rule is for n outcomes:

Expected value
   = Probability (outcome1) x value(outcome1)
       + Probability (outcome2) x value(outcome2)
          + ...
              + Probability (outcomen) x value(outcomen)

A positive expected value is favorable; and the more positive the better. A negative expected value is unfavorable; and more negative the worse. A zero expected value is neutral and the indication of a fair game in gambling situations.

While decisions theorists have approached this criterion from different perspectives, it seems to me that there is only one justification for it:

The expected value is a good estimate of the average return that would be made, were the action to be taken repeatedly.

The coin tossing game illustrates  how this works. There is a probability of 1/2 of winning $1; and a probability of 1/2 of losing $1. Over many plays of the game, there will be some wins and some losses. The frequency of wins and losses will eventually come arbitrarily close to the probabilities of 1/2 for each. That is, the weak law of large numbers tells us this will happen with arbitrarily high probability if the game is played long enough. Loosely speaking, we have

Average net profit over very many plays
  = 1/2 x $1 - 1/2 x $1 = $0

Thus, over the long term of many plays, neither player not house gains on the averages.

If the schedule of payoffs were different, the situation may change. If, for example, a win paid only a net of $1/2, then the expected value of single play would be -$1/2. That would then mean that, on many plays, the house would gain on average $1/2 and the player would lose $1/2 for each play. It would be a poor choice to play the game.

Chuck-a-Luck, Crown and Anchor

This last coin tossing game was fairly transparent. Computing the expected value was likely unnecessary for most of us to see which is a fair game and which not. Things are not always so simple. Then computing the expected value can give useful information.

A game is known as chuck-a-luck in the US and related places; and as crown and anchor in England and related places. It is played with three dice. In the US they are numbered. In England they have the four card suits, a crown and an anchor on their sides.

A player is invited to bet on one of the sides for a fee of $1. In the US game, the player may choose a six . If one of the dice shows a six, the player is paid a net of $1. If two show, the player is paid that twice: a net of $2. If all three dice show a a six , the player is paid this amount ten times: a net of $10.

The game seems fair on cursory considerations. Players often reason as:

"There is a one in six chance of a six on just one die. Since there are three dice, I get three opportunities of a one in six chance. That is, there is a one in two chance of a win. So I have even odds: an equal chance of winning $1 and of losing $1. But wait... when a six comes up on one die, there are still two others. And if any of those also show a six , I win there too. If all three, I get ten times as much! Thus the game favors me. I get three ways of winning to the house's one."

Needless to say, this sort of reasoning is just what the house wants the player to think. In practice, the game is slightly favorable to the house. To see it, we need to compute the probabilities and then the expected value.

This calculation shows already where the informal reasoning of the player has failed. The probability of just one six is merely 75/216 = 0.347, that is, roughly a third. An even odds bet on just this outcome is clearly unfavorable.

P(one six only)
=   P(six on the first die
          and not on the second and third)
           + P(six on the second die
                and not on the the first and third)
                + P(six on the third die
                        and not on the the first and second)
     + (5/6)x(1/6)x(5/6)
         + (1/6)x(5/6)x(5/6)
= 3 x (1/6)x(5/6)x(5/6) = 75/216

Similar calculations give:

P(two sixes only) = = 3 x (1/6)x(1/6)x(5/6) = 15/216

P(three sixes ) = (1/6)x(1/6)x(1/6) = 1/216

P(no sixes ) = (5/6)x(5/6)x(5/6) = 125

We can now combine these probabilities and compute:

Expected value
  = P(one six only)x$1
         + P(two sixes only)x$2
             + P(three sixes )x$3  -  P(no sixes )x$1
=(75x$1 + 15x$2 + 1x$10 - 125x$1) / 216
= -$10/216
= -$0.0463

The calculation makes it easy to see which alteration would in the payouts would make the game fair. That is, if the player were paid a net of $20 on a roll of three sixes, then the expected value would rise to $0 and game would be fair.

That is, the expected value is a net loss of -$0.0463. What this means is that in repeated playing of the game, the player loses an average of $0.0463 on each play.

This average loss is small enough that it is hard to notice beneath the normal variations in wins and losses. If a player makes 100 plays, the player will on average only have lost $4.63. That is just the fee for 4.63 plays For the house, however, this small  loss by the players, is a gain for the house. Over the longer term, these gains build and keep the game profitable.

August 13, 2021

Copyright, John D. Norton