Back to doc list
Paradoxes From Probability Theory
John D. Norton
Department of History and Philosophy of Science
University of Pittsburgh
For a compact reminder of probability theory, see Probability Theory Refresher.
The probability theory provides a quantitative calculus for dealing with chance and uncertainty. It is one of the most successful analytic tools available to us and is often called upon to correct misconceptions about chance. These corrections have an air of paradox to them, since they give results that are, at least initially, quite unexpected. They have a strong presence in the paradox literature. They are not paradoxes of the type that reveals a contradiction in our presupposition. Rather they are paradoxes in the sense that they present us with unexpected results. This chapter presents a brief sampling of them.
Since the paradoxes arise through corrections FROM probability theory TO common misconceptions, they belong in the "from" chapter. Subsequent chapters will investigate a reversed problem. While the probability calculus can be applied profitably to a very large array of problems, there are some circumstances in which it fails. These are presented as paradoxes FOR probability theory.
The paradoxes from probability theory can be categorized loosely into those that arise from improper assessments of mutual exclusivity and from improper assessments of probabilistic independence. Examples of each are given in this chapter.
The commonality in these paradoxes is that they involve a failure to consider all the mutually exclusive outcomes that comprise the outcome space, in a space in which all the most specific, mutually exclusive outcomes have the same probability. Rather, two or more mutually exclusive cases are treated as one case and thus their probabilities are underestimated.
This paradox, already described in the Budget of Paradoxes involves just this mistake concerning mutually exclusive outcomes. Here is the paradox again:
At a carnival sideshow, you are invited to play the following game. There are three cards. One is blue on both sides. One is red on both sides. One is blue on one side and red on the other.
To play the game, you are allowed to shuffle and flip the cards without looking until they could be in any order with any side up.
A card is drawn and it is red. You are offered an even odds bet that the other side of the card is blue.
It seems a fair bet. The color of the other side of the card is either red or blue, so each has (it would seem) an equal chance. That is:
(??) The probability that the color of the other side of the card is blue is 1/2. (??)
This is incorrect and overestimates the chance that the other side is blue, which then inclines us to an unfavorable wager. The ease with which we fall into this error makes it easy for the swindle to occur.
The error in the analysis is that the set of mutually exclusive outcomes has not been assessed correctly. When the card is drawn, there are three mutually exclusive outcomes possible:
The card is red-blue and the red side is uppermost.
The card is red-red and side-1 is uppermost.
The card is red-red and side-2 is uppermost.
If we imagine that the two sides of the red-red card are explicitly numbered, the three possible, mutually exclusive outcomes are shown as:
The shuffling and flipping described above ensures that each of these outcomes has equal probability. In two of three outcomes, the other side of the card is red; and in only one is the other side blue. Since these three outcomes are equally probable, it follows that
The probability that the color of the other
side of the card is blue is 1/3.
Another common formulation of what is essentially the same paradox concerns a two-child family. The essential background assumptions are that:
probability of boy P(B) = probability of girl P(G) = 1/2
and that the probabilities of the gender are independent of the genders of the other children in the family.
Until we have more information that restricts the possibilities, the outcome space is comprised of four mutually exclusive outcomes:
GB, BG, BB
where "GG" = "first child is a girl; second child is a girl" etc.
This idealized set of assumptions could equally be realized with independent coin tosses.
We are asked two question:
"one": One of the children is girl, G. What is the probability that the other is girl, G?
"first": The first child is a girl, G. What is the probability that the second is a girl, G?
It is easy to assume that both questions have the same answer. For in both cases, the other child might be a boy, B, or a girl, G. These two cases then suggest a probability of 1/2 for each. That is:
(??) In both "one" and "first," the probability of a girl, G, is 1/2. (??)
Once again, this conclusion is mistaken. The two questions each lead to different sets of possible, mutually exclusive outcomes and thus different reduced outcome spaces.
In the case of "one," we know that one of the children is a girl. We are not told which is the girl. It could be either the first or the second child. Hence there are three remaining mutually exclusive outcomes possible:
Each has equal probability. In two of these outcomes, the other child is a boy; and in only one is the other child a girl. So we arrive at the correct result:
In "one," the probability of a girl, G, is 1/3.
The case of "first" leads to a different set of mutually exclusive outcomes that are still possible. They are:
Each has equal probability and in only one of them is the other, second child a girl. So we have a different result from "one":
In "first," the probability of a girl, G, is 1/2.
A more striking difference between questions like "one" and "first" appears if we increase the size of the family. The revised questions for a ten-child family are:
"nine": nine of the children are girls, G. What is the probability that the other is girl, G?
"first nine": The first nine children are girls, G. What is the probability that the tenth is a girl, G?
As before, we should resist the temptation to say that in each case the probability of girl, G, is the same: 1/2. The two cases are seen to be very different if we enumerate the mutually exclusive possibilities each allows.
In the case of "nine," there are eleven mutually exclusive possibilities that form the reduced outcome space. Using the notation above, they are:
In only one of these eleven mutually exclusive outcomes is the remaining child a girl. Since these are equally probable, we have:
In "nine," the probability of a girl, G, is 1/11.
The case of "first nine" is different. There are only two remaining, mutually exclusive outcomes possible:
In one of these two cases, the tenth child is a girl. Since these mutually exclusive outcomes are equally probable, we have:
In "first nine," the probability of a girl, G, is 1/2.
We might try to conceive these last paradoxes as arising from errors in judgments of independence. It may seem, for example, that the outcome of nine children being girls and the outcome of the remaining child being a girl are two dependent outcomes. The error is to treat them as independent. This is a weaker diagnosis. The difficulty is that the second outcome in this analysis--the other child is a girl--is not an outcome well-defined prior to specification of the first outcome (that nine children are girls). That imprecision is the deeper cause of the problem and better remedied by a more careful description of the mutually exclusive outcomes of the reduced outcome space.
While we likely believe that we each have a clear grasp of how independent outcomes behave, it is common for us to be mistaken. We know that the successive throws of a six-sided die and the successive tosses of a fair coin are independent. Yet if we are asked to simulate what a sequence of these independent outcomes would look like, we do poorly at the task. We tend to include too many alternations of, say, heads and tails; and not enough successive heads or successive tails.
Here is a sequence of random heads and tails, resulting from a good simulation of independent coin tosses:
It likely looks suspect. It starts with several long runs of heads, HHH, HHH, and HHHH; and then a longer run of tails TTTTT. Five tails is row can happen with a probability of 1/25 = 1/32, but that is a small probability. Has this sequence really been produced by a proper randomization with independent outcomes, we may wonder.
Yet this sequence is a good simulation of random coin tosses. It arises from taking the fractional part of the decimal expansion of π = 3.1415926535897932384626433... and replacing odd digits with "H" and even digits with "T."
These sorts of ordinary difficulties in grasping independence lead to some celebrated paradoxes.
This is perhaps the best known example of how ordinary ideas about independence can lead us to mistaken results. In any game of chance, if the successive outcomes are independent, we expect that the frequencies of outcomes will come to match their probabilities. This is an idea made more precise in the laws of large numbers.
We expect that a six will appear in roughly one sixths of repeated tosses of a fair die. We expect a red will appear roughly as often as black on fair roulette spins, where the numbers are coded as red and black.
Outer rim from https://commons.wikimedia.org/wiki/File:Basic_roulette_wheel.svg
Since the 0 and 00 are neither red not black, the probability of red is the same as the probability of black, 18/38 = 0.474. We expect both red and black to appear with roughly this frequency in repeated spins.
What if we see a long run of red, say, so that the frequency of reds has risen well above that of blacks? If we think in accord with the gambler's fallacy, we know that the average frequency of red and black must be restored. We would then conclude that there will be relatively more blacks in the spins following. For otherwise--we would fallaciously infer--the averages could not be maintained.
Of course this is a fallacy. A roulette wheel is a simple mechanical device that has no memory of what has happened in the past. The probability of a black on the next spin is completely unaffected by whatever may have happened in past spins. For each spin outcome is independent of those before and after it.
Indeed, if one pays attention to the gambling devices in a casino, the prevalence of the gambler's fallacy becomes mysterious. For casinos go to great trouble to reassure gamblers that the reds and blacks upon which they bet are generated by a system whose operation is completely transparent to the gambler. It is simply a well-balanced wheel onto which a ball is thrown while the wheel spins. It is a mechanism in which past and future spins can have no influence on present spins.
How is it that the averages do come to behave well in the long run? This long run behavior is the result of the weight of all the later spins suppressing any deviation in the earlier spins. To see how this works, consider a greatly simplified example. We might have an improbable run of 10 reds in a row. It has a probability of 1/210 = 1/1024.
Then we imagine that it is followed by a well-behaved alternation of red-black-red-...
These subsequent outcomes will eventually outweigh the improbable run of red.
After 10 added red-black pairs, the frequency of red drops to (10+10)/(10 + 20) = 2/3 = 0.667.
After 100 added red-black pairs, the frequency of red drops to (10+100)/(10 + 200) = 110/210 = 0.524.
After 1000 added red-black pairs, the frequency of red drops to (10+1000)/(10 + 20) = 1010/2010 = 0.503.
A roulette wheel needs no memory to return frequencies to the averages expected. It just needs patience to allow future spins to outweigh the outliers.
The gambler's fallacy has long been recognized in the literature on probability. Laplace's 1814 "Essai philosophique sur les probabilités" has a chapter devoted to the common mistakes concerning probabilities. (Ch. XVI "Concerning Illusions in the Estimation of Probabilites.") His versions of the paradox include the following.
Among versions of the paradox is the supposition that, after nine heads are thrown in succession, the tenth will be tails:
"It is, for example, very improbable that at the play of heads and tails one will throw heads ten times in succession. This improbability which strikes us indeed when it has happened nine times, leads us to believe that at the tenth throw tails will be thrown." (p. 162)
Another concerns the drawing of lottery numbers on successive lotteries. The mistaken belief is that a number that hasn't been drawn in the past becomes more probable in the future:
"A similar illusion persuades many people
that one can certainly win in a lottery by placing each time upon the same
number, until it is drawn, a stake whose product surpasses the sum of all
the stakes." (p. 162)
Another version concerns the gender of children born in a month. Assuming that the gender of different births are independent, akin to different coin tosses, we still expect the ratio of boys to girls to be roughly half-half. Someone eager for a son will view with alarm that the proportion of births in some month is initially skewed toward boys. For surely the later birth will be skewed towards girls to even out the averages:
"I have seen men, ardently desirous of having a son, who could learn only with anxiety of the births of boys in the month when they expected to become fathers. Imagining that the ratio of these births to those of girls ought to be the same at the end of each month, they judged that the boys already born would render more probable the births next of girls." (p. 162)
Laplace also reports what is the reverse of the gambler's fallacy. If one finds an unexpected run of repeated heads in coin tosses or of red in roulette, one might imagine that somehow the this run will persist so that one should expect more heads or more reds. Laplace reports the error in the context of lotteries:
"By an illusion contrary to the preceding ones one seeks in the past drawings of the lottery of France the numbers most often drawn, in order to form combinations upon which one thinks to place the stake to advantage. But when the manner in which the mixing of the numbers in this lottery is considered, the past ought to have no influence upon the future. The very frequent drawings of a number are only the anomalies of chance; I have submitted several of them to calculation and have constantly found that they are included within the limits which the supposition of an equal possibility of the drawing of all the numbers allows us to admit without improbability." (pp. 163-64)
The inference might not be mistaken if there were some possibility that the coin tossed is weighted towards heads, or the roulette wheel imbalanced towards red or the numbers drawn in the lottery biased. Then a gambler could use the history of outcomes to estimate the bias in the results; and this would give the gambler an advantage. However the mistake persists--as Laplace points out--even when the mechanism of the drawing is made transparent to the gambler so that the absence of bias is obvious.
The gambler's paradox can come in many varieties which may not be so obviously the same paradox. Consider a factory with machines that make widgets. These widgets must be made to quite exacting tolerances and these standards strain what the machines can do. They produce in-specification and out-of-specification widgets, each with a probability of 1/2; and the probabilities of each on subsequent runs of the machine are independent.
The factory owner is disturbed by the large number of out-of-specification widgets being produced. He decides on a radical policy to increase the number of in-specification widgets:
Each machine will keep producing widgets until it has produced an in-specification widget. It will then halt production for the day.
The owner reasons that with this policy each machine must produce an in-specification widget each day, while it may not produce any out-of-specification widgets, if the first run of the machine produces an in-specification widget.
The policy may initially seem to be promising until we compute the expected number of each type of widget that each machine will produce in one day.
The expected number of in-specification widgets is fixed at one by the policy's requirement that the machine halt production with the first in-specification widget.
Expected (in-specification) = 1
The out-of-specification widgets require a more complicated computation. Writing "in" and "out" with obvious meanings, the possible outcomes for a machine on one day are:
|outcome||probability||number of out x probability|
||0x1/2 = 0
||1x1/4 = 1/4
|out, out, in
||2x1/8 = 2/8
|out, out, out, in
||3x1/16 = 3/16
|out, out, out, out, in
||4x1/32 = 4/32
The expected number of out-of-specification widgets is found by summing the entries in the last column. The resulting infinite series is
0 + 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + ... = 1 = Expectation (out-of-specification)
To see this sum, break up terms like 2/8 = 1/8+1/8, 3/16=1/16+1/16+1/6, etc. Then rearrange the sums to produce many series, such that each term has 1 only in the numerator:
|0 + 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + ...
=0 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64 + ...
+ 1/8 + 1/16 + 1/32 + 1/64 + ...
+ 1/16 + 1/32 + 1/64 + ...
+ 1/32 + 1/64 + ...
The result is that, as far as the expectations are concerned, there will be no advantage. The expectation of in-specification and out-of-specification widgets is the same, one. That is, on average, the production of the factory will produce just as many in-specification widgets as out-of-specification widgets.
This result did not need the big computation just made. If we recall the the results of the runs of a machine are independent of each other, then the result is a foregone conclusion. The error is the same made in the gambler's fallacy: the past history of independent coin tosses or the past history of independent widget factory productions has no effect on the future production.
The average number of in-specification and out-of-specification widgets produced by each machine on a single is the same (since each has a probability 1/2). These averages are unaffected by repeating the runs, even with the policy described (because the runs are independent), so that on average the number of in-specification and out-of-specification widgets in the whole factor remains the same.
However there is one difference, not in the averages, but in the rates that give the average. Each machine now produces exactly one in-specification widget per day.
Each machine also produces a variable number of
out-of-specification widgets. Sometimes none, sometimes one, sometimes
These variable numbers average
out to one. The effect is that sometimes there will be more
in-specification widgets and sometimes fewer of them. But the average will
match the number of out-of-specification widgets.
In his Essay, Laplace gives a version of the same problem. His version has an infinite urn with equal numbers of white and black balls. People choose balls at random and use the factory owner's strategy to try to maximize the number of white balls drawn.
"Thus by imagining an urn filled with an
infinity of white and black balls in equal number, and supposing a great
number of persons each of whom draws a ball from this urn and continues
with the intention of stopping when he shall have extracted a white ball,
one has believed that this intention ought to render the number of white
balls extracted superior to that of the black ones.
Indeed this intention gives necessarily after all the drawings a number of white balls equal to that of persons, and it is possible that these drawings would never lead a black ball.
But it is easy to see that this first notion is only an illusion; for if one conceives that in the first drawing all the persons draw at once a ball from the urn, it is evident that their intention can have no influence upon the color of the balls which ought to appear at this drawing. Its unique effect will be to exclude from the second drawing the persons who shall have drawn a white one at the first. It is likewise apparent that the intention of the persons who shall take part in the new drawing will have no influence upon the color of the balls which shall be drawn, and that it will be the same at the following drawings.
This intention will have no influence then upon the color of the balls extracted in the totality of drawings; it will, however, cause more or fewer to participate at each drawing. The ratio of the white balls extracted to the black ones will differ thus very little from unity. It follows that the number of persons being supposed very large, if observation gives between the colors extracted a ratio which differs sensibly from unity, it is very probable that the same difference is found between unity and the ratio of the white balls to the black contained in the urn." (pp. 168-69)
The paradoxes concerning independence reviewed so far are not so surprising once one reflects on the system considered. Here is one where the result is surprising, I believe, even after all the calculations are reviewed. It concerns our natural tendency to misinterpret independence. We tend to think of outcomes from multiple independent trials to be so free of one another that they spread out in their values. The following paradox shows this tendency in our thinking.
In the US, quintuplet and higher order births are rare. In 2014, there were 47 quintuplet and higher births among 3,988,076 births. In 2016 there were 31 quintuplet and higher births among 3,945,875 births. In 2019, there were 36 quintuplet and higher births among 3,747,540 births.
To keep the numbers simple, let us imagine some country roughly a third the population of the US so that the rate of quintuplet births is 12 per year on average. We will model this probabilistically with two conditions:
• Each quintuplet birth is a probabilistically
random event such that there will be 12 per year on average.
• Quintuplet births are probabilistically independent.
To get a sense of how our intuitions guide us, try to answer the following questions without reading ahead.
What is probability that some year will have 12 quintuplet births, with
exactly one in each of the 12 months of the year?
2. How does the probability of 1. compare with the probability that there will be no quintuplet births at all in some year?
3. How does the probability of 1. and 2. compare with the probability that all quintuplet births in one year (be they 1, 2, ..., 11, 12, 13, ...) occur in just one month, while the remaining eleven months of the year have no quintuplet births?
1. Let us take the first question. Since there are on average 12 quintuplet births a year, we expect there to be one on average each month. Since the numbers can fluctuate, we would not expect exactly one per month for the whole year. However it plausible that exactly one per month is something that happens often enough not to be remarkable.
Now let us calculate. In any month, there may be no, one, two or more quintuplet births. It is then not so surprising that the probability of exactly one quintuplet birth in a single month is roughly 1/3. More precisely:
Probability(exactly one quintuplet births in a month) = 0.3679 = 1/e
where e is base of natural logarithms. To have a full year of one quintuplet births in each month, this outcome must occur 12 times. Since the outcomes are independent, the probability is computed by multiplying together 12 of these probabilities:
Probability(exactly one quintuplet births in
each month for a year)
= 0.367912 = (1/e)12 = 0.000006144
That is, needless to say, a very small probability. It corresponds to the perfectly uniform year of quintuplet births arising once in every 1/0.000006144 = 162,755 years. That is, the occurrence is very, very rare!
What this shows is that our natural, mistaken
inclination with independence is to think of it as a kind of repellent. If
the quintuplet births are independent, we might imagine that a quintuplet
birth in one month somehow precludes or makes less likely another
quintuplet birth in the same month. That repellent effect would then
spread out the quintuplet births over the year failrly uniformly. This
last calculation, however, show just how wrong this
repellent picture of independence. That there has been a
quintuplet birth in some month has no repellent effect on the occurrence
of another quintuplet birth in same month, or, that matter, no attractive
effect either. Independence tells us that these births occur without any
connection to others that may or many not occur.
2. Now consider the probability of no quintuplet births in a year.
Just as a month can have one quintuplet births in it, there may be a month with one fewer (i.e. none) with a similar probability. A more precise calculation show that the probability of none is the same as that of one quintuplet births in a month, that is roughly 1/3, or more precisely:
Probability(no quintuplet births in a month) = 0.3679 = 1/e
The calculation then proceeds as before and we find
Probability(no quintuplet births for a year) = 0.367912 = (1/e)12 = 0.000006144
That is, while both are very unlikely, the probability of no quintuplet births in a full year is equal to that of exactly one quintuplet birth per month.
3. Now consider the probability that all the quintuplet births for a full year occur in just one month, while the remaining 11 months have none. "All" here allows that there might be one, two, three, four or more quintuplet births in a single month, although the probabilities of higher numbers rapidly become quite small.
The more precise calculation given below shows that the probability of all quintuplet births for the year localized to just one month is:
That is, having all the year's quintuplet births concentrated in just one month is over 20 time more probable that having one quintuplet birth per month, spread out over the full year. This is a striking result and, I expect, remote from one we would expect.
Now I will admit to a misdirection in this striking result. The result does not say what it may first seem to say. The figure shows all 12 of the year's births concentrated in one month, June. That is highly misleading. That there are 12 quintuplet births in one year, concentrated in just one month is very unlikely. The bulk of the probability computed above of 20.62(1/e)12 comes from a few cases of very few quintuplet births in the year.
Probability (one quintuplet birth in January, none
in the remaining 11 months)
= (1/e) x (1/e)11 = (1/e)12
If we are interested in the outcome that there is exactly one quintuplet birth in the year, then it could happen in any of the 12 months of the year. So we have:
Probability (one quintuplet birth in some month,
none in the remaining 11 months)
= 12 (1/e)12
This one case, then, is responsible for more that half--12/20.62--of the probability 20.62(1/e)12. The next case of two births in year concentrated in one month supplies over a quarter of the probability 20.62(1/e)12. That is it supplies a fraction 6/20.62.
For experts. Here
is the probability theory behind the numbers reported above. The arrival
of quintuplet births is a Poisson process in which the probability of a
quintuplet birth is some small time interval dt is λdt; and the
probabilities of quintuplet births at different times are independent.
These conditions lead to the probability of n births over time t being a
Poisson distribution, given by
P(n,t) = e-λt(λt)n/n!
The mean of this distribution is λ.
For the case at hand, we
choose the unit of time to be the month and λ=1, since on average we have
one quintuplet birth a month. Setting t=1, we compute for one month:
Probability (no births in one month) = P(0,1) = e-1(1)0/0! = e-1
Probability (one births in one month) = P(1,1) = e-1(1)1/1! = e-1
Probability (n births in one month) = P(n,1) = e-1(1)n/n! = e-1/n!
It follows that the probability both of no quintuplet births in the full year or of exactly one quintuplet birth per month is e-12.
The probability of n births
in January and none in the remaining months is:
Probability of n births in January x Probability no births in February to December
= P(n,1) xP(0,1)11 = e-1/n! x e-11 = e-12/n!
Therefore the probability of all n births in just one month of the year, where that may be any month from January to December, is
= 12 e-12/n!
To find the probability that all the births are localized to just one month of the year, we need to sum over all possible numbers of births, n = 1, 2, 3, 4, ... The resulting probability is:
12 e-12(1/1! + 1/2! + 1/3! + 1/4! + ...) = 12(e-1) e-12
which is the result reported in the text above. The summation relies on the expression for e:
e = 1 + 1/1! + 1/2! + 1/3! + 1/4! + ...
There is a simple
derivation of the Poisson distribution. It is based on dividing the large
time interval t of the distribution into N equal time intervals, t/N,
where N >> n. The probability of a quintuplet birth in one time
interval t/N is λt/N. Hence, using independence, the probability that
there is one quintuplet birth in each of the first n of these intervals
and none in the rest is:
Probability n quintuplet births in each of n intervals
x Probability no quintuplet births in remaining N-n intervals.
= (λt/N)n x (1-λt/N)N-n
Since there are C(N,n) = N!/((N-n)!n!) ways of distributing n births over the N time intervals, the probability of n quintuplet births over the N time intervals is
C(N,n) (λt/N)n x (1-λt/N)N-n = N!/((N-n)!n!) (λt/N)n x (1-λt/N)N-n
Now recall that N>>n, so we have for very large N
N!/((N-n)!n!) = N x (N-1) x (N-2) x ... x (N-n+1) /n! ≈ Nn/n!
We also have for very large N
(1-λt/N)N-n = [(1-λt/N)(N-n)/λt]λt ≈ [e-1]λt = [e-λt]
The approximations become exact in the limit of infinitely large N. Combining we have:
P(n,t) = Nn/n! x (λt/N)n x [e-λt] = e-λt(λt)n/n!
In the case of quintuplet births, the paradox arises from a failure to recognize just how independence manifests. In Simpson's paradox, we see something of a reverse of this problem. A spurious impression arises precisely because we assume independence, when it is not present. That is, we assume from aggregated data that a statistical trend acts in one direction. Yet the real trend is in the opposite direction; and we have failed to see it since we assume tacitly and erroneously that the aggregation of the data was carried out in a manner that respected independence over subcategories.
The difficulty known as "Simpson's paradox" has long been recognized in the statistics literature, prior to Simpson's 1951 paper. There are many examples of it. One of the most cited concerns gender bias in admissions decisions to the University of California, Berkeley, in 1973. Other familiar cases involve the efficacy of various different medications.
To give a simple account of it, I will review here an entirely invented example that uses simple numbers. We compare two treatments for some illness, one is strong and the other weak. We find a puzzling result in our aggregated data. The weaker treatment, weak, has a higher probability of successful treatment, cure, than the stronger treatment. These data, reported as probabilities P are
P (cure | strong) = 0.54
P (cure | weak) = 0.67
This poor performance of strong against weak is puzzling since strong is known always to outperform weak. Patients are divided according to whether their illness is mild or severe. Among the severely ill patients, strong performs better than weak.
P (cure | strong & severe)
P (cure | weak & severe) = 0.3
Strong also outperforms weak in the cases of patients who are mildly ill:
P (cure | strong & mild) = 0.9
P (cure | weak & mild) = 0.7
This presents a puzzle. How is it possible that strong performs better than weak in each of the two cases individually, but strong performs worse that weak in the aggregated data.
The answer is that our puzzlement derives from a hidden assumption of independence. We are tacitly assuming that both treatments are given in equal proportion to both mild and severe patients. That is, the probability that a mildly ill patient is given strong is the same as the probability that the mildy ill patient is given weak. The corresponding assumption is made for the severely ill patients:
P (mild | strong) = P (mild
P (severe | strong) = P (severe | weak)
With this independence assumption, the better performance of strong in the case of the two subcategories of patients is reflected in the aggregated performance. To see how this works take the simple case of that each patient category has an equal chance of either treatment:
P (mild | strong) = P (mild
| weak) = 0.5
P (severe | strong) = P (severe | weak) = 0.5
These 0.5 probabilities weight the performance of the two treatments in the aggregation of the data. That is, using the law of total probability:
P( cure | strong ) = P( cure
| strong & severe ) x P( severe | strong)
+ P( cure | strong & mild ) x P( mild | strong)
= 0.5 x 0.5 + 0.9 x 0.5
P( cure | weak ) = P( cure
| weak & severe ) x P( severe | weak )
+ P( cure | weak & mild ) x P( mild | weak )
= 0.3 x 0.5 + 0.7 x 0.5
That is, find that strong performs better than weak, P( cure | strong ) > P( cure | weak ) as expected.
This assumption of independence is unrealistic. We would expect that the stronger treatment would be given to the more severely ill patients, for they need its strength. Plausibly, the stronger treatment may carry other risks of side effects, so that it is only given when needed. In such circumstances, independence in the assignments of the treatments will fail. Rather we would tend to assign the more severely ill the stronger treatment and the mildly ill the weaker treatment. A regime of treatment assignment that distributes the treatments this way might be:
P (mild | strong) =
0.1 P (mild | weak) = 0.9
P (severe | strong) = 0.9 P (severe | weak) = 0.1
As above, these probabilities are then used to weight the performance of the two treatments to give an aggregated assessment of their relative effectiveness:
P( cure | strong ) = P( cure
| strong & severe ) x P( severe | strong)
+ P( cure | strong & mild ) x P( mild | strong)
= 0.5 x 0.9 + 0.9 x 0.1
P( cure | weak ) = P( cure
| weak & severe ) x P( severe | weak)
+ P( cure | weak & mild ) x P( mild | weak)
= 0.3 x 0.1 + 0.7 x 0.9
That is, in aggregate, strong performs worse that weak, P( cure | strong ) > P( cure | weak ). These are probabilities reported above at the outset of the account of Simpson's paradox.
Once we recognize the failure of independence in assigning treatments to mildly and severely ill patients, this aggregated performance is no longer puzzling. The treatment strong is disproportionately given to the more severely ill patients where any treatment is likely to be less effective. If this failure of independence is overlooked, we arrive at the mistaken impression that the strong treatment is less effective.
How is it possible that we humans have survived so long in a chance filled world, given our evident difficulty with assessing chances correctly?
In the illustration of Simpson's paradox, there were two sorts of probabilities employed: P( cure | strong & severe ) and P( severe | strong ). How should we interpret each? The first pertains to the efficacy of a treatment and the second to the distribution of treatment over patients. Are they sufficiently similar in meaning so that we can combine them in one calculation?
The presumption is that, whenever our intuitions and the probability calculus disagree, then our intuitions are wrong. Is this presumption correct? What justifies it?
August 10, 2021
Copyright, John D. Norton