HPS 0628 | Paradox | |

Back to doc list

Paradoxes From Probability Theory: Independence

John D. Norton

Department of History and Philosophy of Science

University of Pittsburgh

http://www.pitt.edu/~jdnorton

For a compact reminder of probability theory, see Probability Theory Refresher.

While we likely believe that we each have a clear grasp of how independent outcomes behave, it is common for us to be mistaken. We know that the successive throws of a six-sided die and the successive tosses of a fair coin are independent. Yet if we are asked to simulate what a sequence of these independent outcomes would look like, we do poorly at the task. We tend to include too many alternations of, say, heads and tails; and not enough successive heads or successive tails.

Here is a sequence of random heads and tails, resulting from a good simulation of independent coin tosses:

**HT HHHTTHHHTHHHHTHTTTTTHH...**

It likely looks suspect.
It starts with several long runs of heads, HHH, HHH, and HHHH; and then a
longer run of tails TTTTT. Five tails is row can happen with a probability
of 1/2^{5} = 1/32, but that is a small probability. Has this
sequence really been produced by a proper randomization with independent
outcomes, we may wonder.

Yet this sequence is a good simulation of random coin tosses. It arises from taking the fractional part of the decimal expansion of π = 3.1415926535897932384626433... and replacing odd digits with "H" and even digits with "T."

A more vivid failure of our intuitions on independence arises if we have a grid with squares filled in randomly. In the grid below, each cell has a probability of 1/2 of being filled with color (and labeled "1"). The probabilities of color in each cell is independent of the states of all the other cells.

Nevertheless, a cursory scan of the pattern suggests that the placement of the colored cells is not really random. Rather the placement seems to favor a clumping of colored cells. We imagine that this sort of clumping is a manifestation of failure of independence: once one cell is colored, it seems to be more likely that others around it will be colored.

These sorts of ordinary difficulties in grasping independence lead to some celebrated paradoxes.

This is perhaps the best known example of how ordinary ideas about independence can lead us to mistaken results. In any game of chance, if the successive outcomes are independent, we expect that the frequencies of outcomes will come to match their probabilities. This is an idea made more precise in the laws of large numbers.

We expect that a six will appear in roughly one sixths of repeated tosses of a fair die. We expect a red will appear roughly as often as black on fair roulette spins, where the numbers are coded as red and black.

Outer rim from https://commons.wikimedia.org/wiki/File:Basic_roulette_wheel.svg

Since the 0 and 00 are neither red not black, the probability of red is the same as the probability of black, 18/38 = 0.474. We expect both red and black to appear with roughly this frequency in repeated spins.

**RRRRRRRRRR**

What if we see a long run of red, say, so that the frequency of reds has risen well above that of blacks? If we think in accord with the gambler's fallacy, we know that the average frequency of red and black must be restored. We would then conclude that there will be relatively more blacks in the spins following. For otherwise--we would fallaciously infer--the averages could not be maintained.

brain image from
https://commons.wikimedia.org/wiki/File:Noun_Brain_1325493.svg

Of course this is a fallacy. A roulette wheel is a simple mechanical device that has no memory of what has happened in the past. it is not thinking and saying:

"Oh, we have had a long run or reds. I had better now produce a lot more blacks to even things out!"

The probability of a black on the next spin is completely unaffected by whatever may have happened in past spins. For each spin outcome is independent of those before and after it.

https://commons.wikimedia.org/wiki/File:Welcome_to_fabulous_las_vegas_sign.jpg

Indeed, if one pays attention to the gambling devices in a casino, the prevalence of the gambler's fallacy becomes mysterious. For casinos go to great trouble to reassure gamblers that the reds and blacks upon which they bet are generated by a system whose operation is completely transparent to the gambler. It is simply a well-balanced wheel onto which a ball is thrown while the wheel spins. It is a mechanism in which past and future spins can have no influence on present spins.

How is it that the averages do come to behave well
in the long run? This long run behavior is the result of the weight
of all the later spins suppressing any deviation in the earlier
spins. To see how this works, consider a greatly simplified example. We
might have an improbable run of 10 reds in a row. It has a probability of
1/2^{10} = 1/1024.

**RRRRRRRRRR**

Then we imagine that it is followed by a well-behaved alternation of red-black-red-...

**RRRRRRRRRR
RBR****B**R**B**R**B...**

These subsequent outcomes will eventually outweigh the improbable run of red.

After 10 added
red-black pairs, the frequency of red drops to

(10+10)/(10 + 20) = 2/3 = 0.667.

After 100 added
red-black pairs, the frequency of red drops to

(10+100)/(10 + 200) = 110/210 = 0.524.

After 1000 added
red-black pairs, the frequency of red drops to

(10+1000)/(10 + 20) = 1010/2010 = 0.503.

A roulette wheel needs no memory to return frequencies to the averages expected. It just needs patience to allow future spins to outweigh the outliers.

Quotes below from the English translation Pierre
Simon Laplace, *Philosophical Essay on Probabilities*. 6th French
edition. Trans. F. W. Truscott and F. L. Emory. London: John Wiley and
sons, 1902.

The gambler's fallacy has long been recognized in the literature on probability. Laplace's 1814 "Essai philosophique sur les probabilités" has a chapter devoted to the common mistakes concerning probabilities. (Ch. XVI "Concerning Illusions in the Estimation of Probabilites.") His versions of the paradox include the following.

Among versions of the paradox is the supposition that, after nine heads are thrown in succession, the tenth will be tails:

"It is, for example, very improbable that at the play of heads and tails one will throw heads ten times in succession. This improbability which strikes us indeed when it has happened nine times, leads us to believe that at the tenth throw tails will be thrown." (p. 162)

Another concerns the drawing of lottery numbers on successive lotteries. The mistaken belief is that a number that hasn't been drawn in the past becomes more probable in the future:

"A similar illusion persuades many people
that one can certainly win in a lottery by placing each time upon the same
number, until it is drawn, a stake whose product surpasses the sum of all
the stakes." (p. 162)

Image source:
https://commons.wikimedia.org/wiki/File:TheBiLLeLottoBallMachine2012.jpg

Another version concerns the gender of children born in a month. Assuming that the gender of different births are independent, akin to different coin tosses, we still expect the ratio of boys to girls to be roughly half-half. Someone eager for a son will view with alarm that the proportion of births in some month is initially skewed toward boys. For surely the later birth will be skewed towards girls to even out the averages:

"I have seen men, ardently desirous of having a son, who could learn only with anxiety of the births of boys in the month when they expected to become fathers. Imagining that the ratio of these births to those of girls ought to be the same at the end of each month, they judged that the boys already born would render more probable the births next of girls." (p. 162)

https://commons.wikimedia.org/wiki/File:Twemoji_1f6bc.svg

Laplace also reports what is the reverse of the gambler's fallacy. If one finds an unexpected run of repeated heads in coin tosses or of red in roulette, one might imagine that somehow the this run will persist so that one should expect more heads or more reds. Laplace reports the error in the context of lotteries:

"By an illusion contrary to the preceding ones one seeks in the past drawings of the lottery of France the numbers most often drawn, in order to form combinations upon which one thinks to place the stake to advantage. But when the manner in which the mixing of the numbers in this lottery is considered, the past ought to have no influence upon the future. The very frequent drawings of a number are only the anomalies of chance; I have submitted several of them to calculation and have constantly found that they are included within the limits which the supposition of an equal possibility of the drawing of all the numbers allows us to admit without improbability." (pp. 163-64)

The inference might not be mistaken if there were some possibility that the coin tossed is weighted towards heads, or the roulette wheel imbalanced towards red or the numbers drawn in the lottery biased. Then a gambler could use the history of outcomes to estimate the bias in the results; and this would give the gambler an advantage. However the mistake persists--as Laplace points out--even when the mechanism of the drawing is made transparent to the gambler so that the absence of bias is obvious.

https://commons.wikimedia.org/wiki/File:Basketball_clipart_hoop.png

A modern related fallacy is the "hot hand" fallacy. Once a basketball shooter has had some successes in scoring, the player is judged to be "hot" or to have a "hot hand" even though statistically it turns out that that these past successes are commonly not continued.

The gambler's paradox can come in many varieties which may not be so obviously the same paradox. Consider a factory with machines that make widgets. These widgets must be made to quite exacting tolerances and these standards strain what the machines can do. They produce in-specification and out-of-specification widgets, each with a probability of 1/2; and the probabilities of each on subsequent runs of the machine are independent.

The factory owner is disturbed by the large number of out-of-specification widgets being produced. He decides on a radical policy to increase the number of in-specification widgets:

Each machine will keep producing widgets until it has produced an in-specification widget. It will then halt production for the day.

The owner reasons that with this policy each machine must produce an in-specification widget each day, while it may not produce any out-of-specification widgets, if the first run of the machine produces an in-specification widget.

The policy may initially seem to be promising until we compute the expected number of each type of widget that each machine will produce in one day.

It is assumed here that a machine can continue to
produce widgets indefinitely until the sought in-specification widget is
produced. While this idealization would require a
supertask in which an infinity of runs is completed to cover all
the possibilities, in practice it is a benign idealization that hardly
affects the analysis. The probability of a large number of
out-specification widgets is very small.

The expected number of in-specification widgets is fixed at one by the policy's requirement that the machine halt production with the first in-specification widget.

Expected (in-specification) = 1

The out-of-specification widgets require a more complicated computation. Writing "in" and "out" with obvious meanings, the possible outcomes for a machine on one day are:

outcome |
probability |
number of out x probability |

in |
1/2 |
0x1/2 = 0 |

out, in |
1/4 |
1x1/4 = 1/4 |

out, out, in |
1/8 |
2x1/8 = 2/8 |

out, out, out, in |
1/16 |
3x1/16 = 3/16 |

out, out, out, out, in |
1/32 |
4x1/32 = 4/32 |

... |
... |
... |

The expected number of out-of-specification widgets is found by summing the entries in the last column. The resulting infinite series is

0 + 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + ... = 1 = Expectation (out-of-specification)

To see this sum, break up terms like 2/8 = 1/8+1/8, 3/16=1/16+1/16+1/6, etc. Then rearrange the sums to produce many series, such that each term has 1 only in the numerator:

0 + 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + ... =0 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64 + ... + 1/8 + 1/16 + 1/32 + 1/64 + ... + 1/16 + 1/32 + 1/64 + ... + 1/32 + 1/64 + ... + ... |
=1/2 + 1/4 + 1/8 +1/16 + ... = 1 |

The result is that, as far as the expectations are concerned, there will be no advantage. The expectation of in-specification and out-of-specification widgets is the same, one. That is, on average, the production of the factory will produce just as many in-specification widgets as out-of-specification widgets.

This result did not need the big computation just made. If we recall the the results of the runs of a machine are independent of each other, then the result is a foregone conclusion. The error is the same made in the gambler's fallacy: the past history of independent coin tosses or the past history of independent widget factory productions has no effect on the future production.

The average number of in-specification and out-of-specification widgets produced by each machine on a single run is the same (since each has a probability 1/2). These averages are unaffected by repeating the runs, even with the policy described (because the runs are independent), so that on average the number of in-specification and out-of-specification widgets in the whole factor remains the same.

However there is one difference, not in the averages, but in the rates that give the average. Each machine now produces exactly one in-specification widget per day.

Each machine also produces a variable number of
out-of-specification widgets. Sometimes none, sometimes one, sometimes
two, etc.

or

or

or ...

These variable numbers average
out to one. The effect is that sometimes there will be more
in-specification widgets and sometimes fewer of them. But the average will
match the number of out-of-specification widgets.

In his *Essay*, Laplace gives a version
of the same problem. His version has an infinite urn with equal
numbers of white and black balls. People choose balls at random and use
the factory owner's strategy to try to maximize the number of white balls
drawn.

For another day: Laplace's specification of an
infinity of white and black balls defeats his analysis. For the infinities
are now those that are governed by Cantor's theory, so that the associated
probabilities of white and black become ill-defined. If we number the
balls 1, 2, 3, ..., we could label each odd-numbered ball white and each
even-numbered ball black. Or we could label each power of 10 white, and
the remainder black. Each of these cannot give the same probabilities.
Laplace's example can be made to work if we assume a very, very large
number of balls, but still finite in number.

"Thus by imagining an urn filled with an
infinity of white and black balls in equal number, and supposing a great
number of persons each of whom draws a ball from this urn and continues
with the intention of stopping when he shall have extracted a white ball,
one has believed that this intention ought to render the number of white
balls extracted superior to that of the black ones.

Indeed this intention gives necessarily after all the drawings a number of
white balls equal to that of persons, and it is possible that these
drawings would never lead a black ball.

But it is easy to see that this first notion is only an illusion; for if
one conceives that in the first drawing all the persons draw at once a
ball from the urn, it is evident that their intention can have no
influence upon the color of the balls which ought to appear at this
drawing. Its unique effect will be to exclude from the second drawing the
persons who shall have drawn a white one at the first. It is likewise
apparent that the intention of the persons who shall take part in the new
drawing will have no influence upon the color of the balls which shall be
drawn, and that it will be the same at the following drawings.

This intention will have no influence then upon the color of the balls
extracted in the totality of drawings; it will, however, cause more or
fewer to participate at each drawing. The ratio of the white balls
extracted to the black ones will differ thus very little from unity. It
follows that the number of persons being supposed very large, if
observation gives between the colors extracted a ratio which differs
sensibly from unity, it is very probable that the same difference is found
between unity and the ratio of the white balls to the black contained in
the urn." (pp. 168-69)

The problem was already established as a routine
puzzle offered in mid-20th century popular writing, such as George
Gamow's, *One Two Three ... Infinity*. New YOrk: Viking Press,
147, 1961. pp. 213-215.

The birthday problem has been widely recounted as an illustration of how our intuitions about probabilities can be mistaken. The problem concerns the birthdays of people in a group. We are to assume that, for each person, all days in the year are equally probable; and that the probabilities for each person's birthday is independent of the birthdays of other people in the group.

The problem is this:

How large does the group need to be for there to be a probability of 1/2 that at least two of the people share the same birthday?

Since there are 365 days in the year, casual reflection suggests that we need a large group whose size is comparable with 365.

To get a first grasp on the problem, we might ask, how big must the group be for someone to share my birthday. Since each person in the group has only a probability of 1/365 of sharing my birthday, it would seem that we need a large group.

To see that this supposition is right, consider how
large that group can be *without* anyone
sharing my birthday. There is probability of

364/365

that any one specified person in the group fails to share by birthday. What if we have a group with 365 people? The probability that no-one shares my birthday is

364/365 x 364/365 x ... x 364/365

where the factor of 364/365 is multiplied by itself 365 times. Computing:

The constant e is the base of natural logarithms and
one definition is that it is the limit as N goes to infinity of (1+1/N)^{N}.
The limit of (1-1/N)^{N} gives us 1/e.

(364/365)^{365} =(1 - 1/365)^{365}
≈ 1/e = 0.3679

That is, the probability is only 1-0.3679 = 0.6321 that someone in the group shares my birthday, even when the group has 365 people in it.

It does seem the birthday duplications require large groups.

The problem posed here, however, is slightly
different. It does not ask after the chance that someone shares *my*
birthday, but that any two people in the group
share a birthday. It is hard to imagine that this small change would make
much difference to the size of the group needed. It seems quite plausible
that we will need a group of, say, half the size of 365, which is roughly
182. Perhaps that may seem too large. Might 150 or even 100 be large
enough?

The correct answer, to be computed below, is that all these numbers are too large. Let us trace through the calculation to see what the correct answer is.

A simpler, warm up problem gives us the basics of the calculation. If there were just ten days in the year, the calculation would proceed as follows. We consider the people in the group one by one and ask for the probability that there is NO duplication of birthdays.

Person 1 in the group can have any birthday.

If there is to be no duplication, person 2 must have a birthday on any of the nine remaining days.

Probability of no duplication with two people = 9/10

In there is to be no duplication, person 3 must have a birthday on any of the remaining 8 days.

Probability of no duplication with three people

=
(9/10)x(8/10) = 0.72

Proceeding in this way we find:

Probability of no duplication with four
people

=
(9/10)x(8/10)x(7/10) = 0.504

Probability no duplication with five people

=
(9/10)x(8/10)x(7/10)x(6/10) = 0.3024.

Somewhere between four and five people in the group is where the probability of no duplication dips below 0.5. That is the point at which the probability of at least one duplication passes 0.5. This result matches the hunch that a group with roughly half the number of people as days in the year are needed to solve the problem.

Following the template of the ten day year problem, we can now calculate the probabilities of no duplication for the longer, 365 day year:

Probability of no duplication with two
people

= (364/365)

Probability of no duplication with three
people

= (364/365) x
(363/365)

Probability of no duplication with four
people

= (364/365) x
(363/365) x (362/365)

and so on.

The spreadsheet in which these calculations are
carried out is here.

For this problem, the simplest solution comes from
computing these products directly. We find from these calculations that

number in group |
probability of no duplication, p |
probability of at least one duplication, 1-p |

10 | 0.883 | 0.117 |

20 | 0.589 | 0.411 |

22 | 0.5243 | 0.4756 |

23 | 0.4927 | 0.5073 |

30 | 0.2936 | 0.7063 |

40 | 0.1087 | 0.8912 |

50 | 0.0296 | 0.9704 |

60 | 0.0059 | 0.9941 |

70 | 0.0008 | 0.9992 |

Why "at least two people"?
The second column shows the probability that no one in the group shares a
birthday with anyone else in the group. The third column shows the
probability of the complement of this outcome. It consists of all the
possibilities excluding the case of no shared birthdays. These
possibilities include several people sharing the same birthday; and also
more than one pair of people sharing the same birthdays.

That is, once we have a group with 23 people in it, there is a probability greater than 1/2 that at least two people in the group share a birthday.

This number, 23, is surely surprisingly small! Expressed as a fraction of the total number of days, it is:

23/365 =0.063

That is, we need a group that has only 6.3% of the size of the number of days in a year to have just over a one in two chance of at least one duplication of birthdays.

The table shows how quickly the probability of at least one duplication grows with increases in the size of the group. Once the group has 60 people, the probability of at least one duplication passes 0.99. That size group is only 60/365 = 0.164 or 16.4% of the number of days in a year.

Why were we so ready to imagine that a much larger number is needed? It is, once again, our natural misappraisal of independence. We expect independently chosen birthdays to be spread out over the year. When we have 10 or 20 people in the group, there are still 355 or 345 dates in the year on which no one in the group has a birthday. It seems that there is plenty of space in the days of the year in which independence can spread the birthdays of new members of the group before there are any duplications.

The mistake is to imagine that independence is somehow trying to spread out the birthdays. Independence has no such brief. As we have seen above, independent result exhibit far more clustering than we intuitively expect. The clustering of these birthdays is another example of it.

How did the "*My*"
Birthday Problem mislead us? Is easy to overlook that, once we
have a group, the "*My*" Birthday Problem repeats many times. if we
have 10 in the group, there there are (10x10)/2 = 50 pairings of people,
each with its own "*My*" Birthday Problem. If there are 20 people
in the group, then there are (20x20)/2 = 200 of these pairings. All it
takes for a single duplication to arise is that *just one* of
these pairings leads to a matched birthday. With this many opportunities,
even an event of low probability individually becomes llkely to happen
somewhere.

The surprising result in the birthday problem is that it is solved by such a small fraction of the number of days in a year. That fraction would keep decreasing if there were more days in the year than 365. This effect arises in a closely related problem, the lottery ticket number problem.

In some lottery, players choose a number from large set and, if their chosen number matches that of a random drawing, they win.

What is the probability that at least two players choose the same number?

More specifically, how does this probability depend on the number of players and the size of the set of numbers from which the players can choose.

In real games of this
type, independence fails. Players are more likely to choose familiar
numbers like 123 as opposed to one that has no familiar pattern. That is,
the probability that one player chooses some specific number is increased
if other players have chosen it. It is a popular number.

As in the birthday problem, we assume that a player chooses such that each number has the same probability; and that these probabilities are independent of the numbers chosen by other players.

This problem is structurally just like the birthday problem, except that now the the analog to the number days in a year, the size of the set of numbers available, can grow much larger than 365. The calculation of the probability that there are no duplications in the numbers chosen is the same in form as for the birthday problem.

If we have ten players and a choice of 1,000 possible numbers, the probability of no duplication is:

(999/1000) x (998/1000) x (997/1000)

x (996/1000) x (995/1000) x (994/1000)

x (993/1000) x
(992/1000) x (991/1000)

The probability that there is at least one duplication is just one minus this last probability.

In a document linked here,
I provide the calculations needed to arrive at the simple formulae. They
proved to be more taxing than I expected. Word
source document here.
The spreadsheet from which the numbers in the table are drawn is here.

Computing products like these cease to be practical when the number of products grow very large. Instead, it is possible to use simple formulae that approximate the results very closely. Typical results are shown in the table below.

size of set of numbers |
number of players to have probability 0.999 of duplication |
number of players as fraction of size of set. |

1,000 | 118 | 0.118 |

10,000 | 372 | 0.0372 |

100,000 | 1,176 | 0.01176 |

1,000,000 | 3,717 | 0.003717 |

A probability of duplication of 0.999 is very high and, in most real situation, counts as practical certainty. The number of players needed to secure it, however, proves to be very small. If there are 1,000,000 numbers from which to choose, to secure this high probability we need only 3,717 players. That number is just 0.37% the size of the set of numbers from which the players can choose.

The approximation
formula that applies when the set of possible numbers is large
is:

number of players n to assure at least one duplication

with probability p and a set of numbers of size N

= √(-2 ln (1-p) N)

This means that the number needed to secure some probability p grows with
the square root of N. In the table, with each row, we increase the size of
the set of tickets by a factor of 10. The number of players needed to
retain the probability of 0.999 grew only by a factor of √10 = 3.16.
Correspondingly, the fraction in the third column reduces with each row by
a factor of 1/√10 = 0.316. It follows that this fraction can be made
arbitrarily small merely by making the size of the set of numbers
arbitrarily large.

The probability computed above should be distinguished from the probability that at least one and possibly more players choose the winning ticket. That probability is much smaller. Its computation is shown below.

For a
set of 1,000 numbers and set of players of size 1,000, the
probability of no-one choosing the winning number is

(999/1000)^{1000} = (1 - 1/1000)1000 ≈ 1/e = 0.3679

The probability that at least one player chooses the winning number is
then 1 - 0.3679 = 0.6321. Over half of this probability, however, accrues
to the case in which just one player chooses the winning number. That
probability is

1000 x (1/1000) x (999/1000)^{999 } = 0.3680

The term (1/1000) x (999/1000)^{999} is the probability that some
particular player only wins. It is multiplied by 1000, since there are
1000 players supposed.

This result is insensitive to the use of 1,000 in the calculation. Similar
results follow if we had replaced it with 10,000 or 100,000 and so on.

The paradoxes concerning independence reviewed so far are not so surprising once one reflects on the system considered. Here is one where the result is surprising, I believe, even after all the calculations are reviewed. It concerns our natural tendency to misinterpret independence. We tend to think of outcomes from multiple independent trials to be so free of one another that they spread out in their values. The following paradox shows this tendency in our thinking.

In the US, quintuplet and higher order births are rare. In 2014, there were 47 quintuplet and higher births among 3,988,076 births. In 2016 there were 31 quintuplet and higher births among 3,945,875 births. In 2019, there were 36 quintuplet and higher births among 3,747,540 births.

To keep the numbers simple, let us imagine some country roughly a third the population of the US so that the rate of quintuplet births is 12 per year on average. We will model this probabilistically with two conditions:

• Each quintuplet birth is a probabilistically
random event such that there will be 12 per year on average.

• Quintuplet births are probabilistically independent.

To get a sense of how our intuitions guide us, try to answer the following questions without reading ahead.

1.
What is probability that some year will have 12 quintuplet births, with
exactly one in each of the 12 months of the year?

2. How does the
probability of 1. compare with the probability that there will be no
quintuplet births at all in some year?

3. How does the
probability of 1. and 2. compare with the probability that *all*
quintuplet births in one year (be they 1, 2, ..., 11, 12, 13, ...) occur
in just one month, while the remaining eleven months of the year have no
quintuplet births?

??????????

1. Let us take the first question. Since there are on average 12 quintuplet births a year, we expect there to be one on average each month. Since the numbers can fluctuate, we would not expect exactly one per month for the whole year. However it plausible that exactly one per month is something that happens often enough not to be remarkable.

Jan | Feb | Mar | Apr |

May | Jun | Jul | Aug |

Sep | Oct | Nov | Dec |

Now let us calculate. In any month, there may be no, one, two or more quintuplet births. It is then not so surprising that the probability of exactly one quintuplet birth in a single month is roughly 1/3. More precisely:

Probability(exactly one quintuplet births in a month) = 0.3679 = 1/e

For experts: see below for the calculations behind
this and later probability values.

where e is base of natural logarithms. To have a full year of one quintuplet births in each month, this outcome must occur 12 times. Since the outcomes are independent, the probability is computed by multiplying together 12 of these probabilities:

Probability(exactly one quintuplet births in
each month for a year)

= 0.3679^{12} = (1/e)^{12} = 0.000006144

That is, needless to say, a very small probability. It corresponds to the perfectly uniform year of quintuplet births arising on average once in every 1/0.000006144 = 162,755 years. That is, the occurrence is very, very rare!

What this shows is that our natural, mistaken
inclination with independence is to think of it as a kind of repellent. If
the quintuplet births are independent, we might imagine that a quintuplet
birth in one month somehow precludes or makes less likely another
quintuplet birth in the same month. That repellent effect would then
spread out the quintuplet births over the year failrly uniformly. This
last calculation, however, show just how wrong this
repellent picture of independence. That there has been a
quintuplet birth in some month has no repellent effect on the occurrence
of another quintuplet birth in same month, or, that matter, no attractive
effect either. Independence tells us that these births occur without any
connection to others that may or many not occur.

2. Now consider the probability of no quintuplet births in a year.

Jan | Feb | Mar | Apr |

May | Jun | Jul | Aug |

Sep | Oct | Nov | Dec |

Just as a month can have one quintuplet births in it, there may be a month with one fewer (i.e. none) with a similar probability. A more precise calculation show that the probability of none is the same as that of one quintuplet births in a month, that is roughly 1/3, or more precisely:

Probability(no quintuplet births in a month) = 0.3679 = 1/e

The calculation then proceeds as before and we find

Probability(no quintuplet births for a year) =
0.3679^{12} = (1/e)^{12} = 0.000006144

That is, while both are very unlikely, the probability of no quintuplet births in a full year is equal to that of exactly one quintuplet birth per month.

The probability of 3 quintuplet births in a month is
1/e^{3} = 0.0489, of 4, = 1/e^{4} = 0.0183 and of 5, = 1/e^{5}
= 0.0067.

3. Now consider the probability that all the quintuplet births for a full year occur in just one month, while the remaining 11 months have none. "All" here allows that there might be one, two, three, four or more quintuplet births in a single month, although the probabilities of higher numbers rapidly become quite small.

Jan | Feb | Mar | Apr |

May | Jun | Jul | Aug |

Sep | Oct | Nov | Dec |

The more precise calculation given below shows that the probability of all quintuplet births for the year localized to just one month is:

Probability(all quintuplet births in one
month of the year)

= 12(e-1)(1/e)^{12} = 20.62(1/e)^{12}

= 20.62 x Probability(exactly one quintuplet birth in each month of the year)

= 12(e-1)(1/e)

= 20.62 x Probability(exactly one quintuplet birth in each month of the year)

That is, having all the year's quintuplet births concentrated in just one month is over 20 time more probable that having one quintuplet birth per month, spread out over the full year. This is a striking result and, I expect, remote from one we would expect.

To be precise, the probability of 12 births, all
concentrated in one month, is e^{(-12)}/11! = 0.0000000000001539

Now I will admit to a
misdirection in this striking result. The result does not say
what it may first seem to say. The figure shows all 12 of the year's
births concentrated in one month, June. That is highly misleading. That
there are 12 quintuplet births in one year, concentrated in just one month
is very unlikely. The bulk of the probability computed above of 20.62(1/e)^{12}
comes from a few cases of very few quintuplet births in the year.

For example:

Probability (one quintuplet birth in January, none
in the remaining 11 months)

= (1/e) x (1/e)^{11} = (1/e)^{12}

If we are interested in the outcome that there is exactly one quintuplet birth in the year, then it could happen in any of the 12 months of the year. So we have:

Probability (one quintuplet birth in some month,
none in the remaining 11 months)

= 12 (1/e)^{12}

This one case, then, is
responsible for more that half--12/20.62--of the probability
20.62(1/e)^{12}. The next case of two births in year concentrated
in one month supplies over a quarter of the probability 20.62(1/e)^{12}.
That is it supplies a fraction 6/20.62.

*For experts.* Here
is the probability theory behind the numbers reported above. The arrival
of quintuplet births is a Poisson process in which the probability of a
quintuplet birth is some small time interval dt is λdt; and the
probabilities of quintuplet births at different times are independent.
These conditions lead to the probability of n births over time t being a
Poisson distribution, given by

P(n,t) = e^{-λt}(λt)^{n}/n!

The mean of this distribution is λ.

For the case at hand, we
choose the unit of time to be the month and λ=1, since on average we have
one quintuplet birth a month. Setting t=1, we compute for one month:

Probability (no births in one month) = P(0,1) = e^{-1}(1)^{0}/0!
= e^{-1}

Probability (one births in one month) = P(1,1) =
e^{-1}(1)^{1}/1! = e^{-1}

Probability (n births in one month) = P(n,1) = e^{-1}(1)^{n}/n!
= e^{-1}/n!

It follows that the probability both of no quintuplet births in the full
year or of exactly one quintuplet birth per month is e^{-12}.

The probability of n births
in January and none in the remaining months is:

Probability of n births in January x Probability no births in February to
December

= P(n,1) xP(0,1)^{11} = e^{-1}/n!
x e^{-11} = e^{-12}/n!

Therefore the probability of all n births in just one month of the year,
where that may be any month from January to December, is

= 12 e^{-12}/n!

To find the probability that all the births are localized to just one
month of the year, we need to sum over all possible numbers of
births, n = 1, 2, 3, 4, ... The resulting probability is:

12 e^{-12}(1/1! + 1/2! + 1/3! + 1/4! +
...) = 12(e-1) e^{-12}

which is the result reported in the text above. The summation relies on
the expression for e:

e = 1 + 1/1! + 1/2! + 1/3! + 1/4! + ...

There is a simple
derivation of the Poisson distribution. It is based on dividing the large
time interval t of the distribution into N equal time intervals, t/N,
where N >> n. The probability of a quintuplet birth in one time
interval t/N is λt/N. Hence, using independence, the probability that
there is one quintuplet birth in each of the first n of these intervals
and none in the rest is:

Probability n quintuplet births in each of n
intervals

x Probability no quintuplet births in remaining N-n intervals.

= (λt/N)^{n} x (1-λt/N)^{N-n}

Since there are C(N,n) = N!/((N-n)!n!) ways of distributing n births over
the N time intervals, the probability of n quintuplet births over the N
time intervals is

C(N,n) (λt/N)^{n} x (1-λt/N)^{N-n}
= N!/((N-n)!n!) (λt/N)^{n} x (1-λt/N)^{N-n}

Now recall that N>>n, so we have for very large N

N!/((N-n)!n!) = N x (N-1) x (N-2) x ... x (N-n+1)
/n! ≈ N^{n}/n!

We also have for very large N

(1-λt/N)^{N-n} = [(1-λt/N)^{(N-n)/λt}]^{λt}
≈ [e^{-1}]^{λt} = [e^{-λt}]

The approximations become exact in the limit of infinitely large N.
Combining we have:

P(n,t) = N^{n}/n! x (λt/N)^{n} x
[e^{-λt}] = e^{-λt}(λt)^{n}/n!

In the case of quintuplet births, the paradox arises from a failure to recognize just how independence manifests. In Simpson's paradox, we see something of a reverse of this problem. A spurious impression arises precisely because we assume independence, when it is not present. That is, we assume from aggregated data that a statistical trend acts in one direction. Yet the real trend is in the opposite direction; and we have failed to see it since we assume tacitly and erroneously that the aggregation of the data was carried out in a manner that respected independence over subcategories.

The example of Berkeley admissions is given in P. J.
Bickel, E. A. Hamel, J. W. O'Connell, "Sex Bias in Graduate Admissions
Data from Berkeley," *Science*. **187** (1975), pp.
398-404.

The difficulty known as "Simpson's paradox" has long been recognized in the statistics literature, prior to Simpson's 1951 paper. There are many examples of it. One of the most cited concerns gender bias in admissions decisions to the University of California, Berkeley, in 1973. Other familiar cases involve the efficacy of various different medications.

To give a simple account of it, I will review here
an entirely invented example that uses simple numbers. We compare
two treatments for some illness, one is *strong* and the
other *weak*. We find a puzzling result in our aggregated data.
The weaker treatment, *weak*, has a higher probability of
successful treatment, *cure*, than the stronger treatment. These
data, reported as probabilities P are

P (*cure | strong*) = 0.54

P (*cure | weak*) = 0.67

This poor performance of *strong* against *weak*
is puzzling since *strong* is known always
to outperform *weak*. Patients are divided according to whether
their illness is *mild* or *severe*. Among the severely
ill patients, *strong* performs better than *weak.*

P (*cure | strong* & *severe*)
= 0.5__
__P (

*Strong* also outperforms *weak* in
the cases of patients who are mildly ill:

P (*cure | strong* & mild) = 0.9__
__P (

This presents a puzzle.
How is it possible that *strong* performs better than *weak*
in each of the two cases individually, but *strong* performs worse
that *weak* in the aggregated data.

The answer is that our puzzlement derives from a hidden
assumption of independence. We are tacitly assuming that both
treatments are given in equal proportion to both *mild* and *severe*
patients. That is, the probability that a mildly ill patient is given *strong*
is the same as the probability that the mildy ill patient is given *weak*.
The corresponding assumption is made for the severely ill patients:

P (*mild | **strong*) = P (*mild
| **weak*)__ __

P (*severe | **strong*) = P (*severe | weak*)

With this independence
assumption, the better performance of *strong* in the
case of the two subcategories of patients is reflected in the aggregated
performance. To see how this works take the simple case of that each
patient category has an equal chance of either treatment:

P (*mild | **strong*) = P (*mild
| **weak*) = 0.5

P (*severe | strong*) = P (severe* | weak*) = 0.5

These 0.5 probabilities weight the performance of the two treatments in the aggregation of the data. That is, using the law of total probability:

P( *cure* | *strong* ) = P( *cure*
| *strong & severe* ) x P( *severe | strong)
+ *P(

= 0.5 x 0.5 + 0.9 x 0.5

= 0.7

P( *cure* | *weak* ) = P( *cure*
| *weak & severe* ) x P( *severe | weak )
+ *P(

= 0.3 x 0.5 + 0.7 x 0.5

= 0.5

That is, find that *strong*
performs better than *weak*, P( *cure* | *strong*
) > P( *cure* | *weak* ) as expected.

This assumption of independence is unrealistic. We would expect that the stronger treatment would be given to the more severely ill patients, for they need its strength. Plausibly, the stronger treatment may carry other risks of side effects, so that it is only given when needed. In such circumstances, independence in the assignments of the treatments will fail. Rather we would tend to assign the more severely ill the stronger treatment and the mildly ill the weaker treatment. A regime of treatment assignment that distributes the treatments this way might be:

P (*mild | **strong*) =
0.1 P (*mild | **weak*) = 0.9__ __

P (*severe | **strong*) = 0.9 P (*severe | weak*)
= 0.1

As above, these probabilities are then used to weight the performance of the two treatments to give an aggregated assessment of their relative effectiveness:

P( *cure* | *strong* ) = P( *cure*
| *strong & severe* ) x P( *severe | strong)
+ *P(

= 0.5 x 0.9 + 0.9 x 0.1

= 0.54

P( *cure* | *weak* ) = P( *cure*
| *weak & severe* ) x P( *severe | weak)
+ *P(

= 0.3 x 0.1 + 0.7 x 0.9

= 0.67

That is, in aggregate, *strong*
performs worse that *weak*, P( *cure* | *strong*
) > P( *cure* | *weak* ). These are probabilities
reported above at the outset of the account of Simpson's paradox.

Once we recognize the failure of independence in
assigning treatments to mildly and severely ill patients, this aggregated
performance is no longer puzzling. The treatment *strong* is disproportionately
given to the more severely ill patients where any treatment is
likely to be less effective. If this failure of independence is
overlooked, we arrive at the mistaken impression that the strong treatment
is less effective.

Once you have see how Simpson's paradox arises, it is not too hard to find many cases that operate in the same way. That is, we have some process that operate as expected in all specific cases. However if we aggregate the statistics over all the cases, the process appears to run in the opposite, unexpected direction.

Here are a few examples. See if you can figure them out. To see the answers, click on the prompt with the triangle mark.

The hidden correlation lies in the prudent practice of smaller hospitals. If a patient is gravely ill, the patient will be sent to a larger hospital where better treatment can be provided. Thus larger hospitals tend to have more gravely ill patients and thus more mortality.

The hidden correlation is that people who do not have antecedent problems sleeping tend not to take sleep medication, so they automatically have less trouble sleeping.

The hidden correlation is that cars that have recently had repair work done are likely to be older and thus more prone to breakdowns.

The hidden correlation is that snow tires tend to be fitted to cars only in snowy winter months and in states with colder weather. Cars in warmer months and in states with warmer weather are much less likely to have accidents in snowy and icy conditions.

In the illustration of Simpson's paradox, there
were two sorts of probabilities employed: P( *cure* | *strong
& severe* ) and P( *severe | strong *). How should we
interpret each? The first pertains to the efficacy of a treatment and the
second to the distribution of treatment over patients. Are they
sufficiently similar in meaning so that we can combine them in one
calculation?

The presumption is that, whenever our intuitions and the probability calculus disagree, then our intuitions are wrong. Is this presumption correct? What justifies it?

August 10, November 17, 25, 2021. May 5, 2022. April 7, 2023.

Copyright, John D. Norton