PROBABILITIES ALWAYS SUM TO 1….RIGHT?

Well – it depends what you mean by that.

I get this question every time I teach probability, so I figured I’d put it and its answer out there for everyone once and for all.

“Why do we use probability density functions (pdfs) for continuous distributions and probability mass functions (pmfs) for discrete distributions?”

Put another way,

“Why can’t we use pmfs for both discrete and continuous distributions?”

For those of you who are with me up to here, the answer is that an uncountable sum of positive numbers is infinite, so it prohibits probabilities summing to 1.  Skip to the end for the proof.

For those of you that aren’t, let me explain what all this means, and then you too can skip to the end when you’re comfortable with it.  

DISCRETE DISTRIBUTIONS

THE FINITE CASE

When we talk about probabilities, we assign numbers to events that can be used to describe their relative frequency for occurring. 

For example, if we toss a coin, there are only 2 outcomes, heads or tails (H or T). 

  • If it’s a fair coin, there’s a 50/50 chance for each occurring, and we’d say the probability of an H is 0.5, similarly for the probability of a T. 
  • If it’s a biased coin (say 30/70 H and T, respectively), then the probability of an H is 0.3, and the probability of a T is 0.7.

These numbers can be used to describe the relative frequency of each event occurring.  So for example, if we flip that biased coin 100 times, we’d expect 30 = 0.3 x 100 to come up H, and 70 = 0.7 x 100 to come up T.

In general, we refer to a Sample Space as the space of all possible outcomes, and it’s equipped with a probabilities that each outcome will happen.  In the 50/50 coin flipping example, the sample space and probabilities are,

math equation

Probabilities have to be non-negative (what’s a negative count mean?) and sum to 1 (it’s the space of all possible outcomes, so something in there’s gotta happen).

From here, most classes introduce the concept of conditional probability, pronounced “the probability of A given B”, to quantify the impact knowing B has on the likelihood of A occurring,

math equation

The two events A and B are called independent if knowing B has no effect on knowing A, that is, 

math equation

which is the same as

math equation

which is the usual “product rule” introduced in probability classes.

A quick and useful example:  suppose a family has 3 children, and you know two of them are boys.  What’s the probability the third child is a boy?  The answer is not 1/2.

The sample space is,

math equation

each event occurring with probability 1/8 (each child’s birth is independent of the other).  So, if A is the event of 3 boys, and B is the event of having 2 boys,

math equation

Notice the 8’s cancel.  What’s happening here is that the probability of A is weighed against the probability of B occurring, which we assume happened, not just against 1 which happens in the unconditional case. 

Fits with what you’d think the impact of knowing B has on knowing A dontcha think?

THE INFINITE CASE

With me so far? 

The previous examples were ones where the sample spaces were finite.  Most classes then proceed to introduce examples the sample spaces are infinite.

Consider the number of times my cat has to beg me to play with her before I do.  Each time she begs fails with probability q=2/3, and succeeds with probability p=1/3.  We assume she eventually succeeds (I do love my cat after all).

If S is a success and F is a failure, the sample space consists of sequences,

math equation

If X is the number times she has to beg me before I play with her, the probabilities are,

math equation

If you tried to sum these using a geometric series, you’d find,

math equation

so we’re good with the whole “probabilities sum to 1” thing.  This is an example of a geometric distribution

In case you’re wondering, the average of a geometric distribution is 1/p, so my cat has to beg me an average of 3 times before I play with her.  I love my cat.

CONTINUOUS DISTRIBUTIONS

Still with me?

The previous examples were called discrete distributions because they’re examples of sample spaces that are finite, or infinite, but have “gaps” between events.  Probabilities are then assigned to each “atom” in the space and sum to 1.  These are called probability mass functions.

What about sample spaces that are infinite but have no “gaps” between them, like the amount of time I wait for my morning espresso?

Here’s where the psychological jump happens.  Instead of using a (reasonably suggestive) probability mass function for each value that can occur, most classes deal with this by introducing probability density functions. 

In my morning espresso example, the sample space is,

math equation

If X is the number of minutes I wait, the probability I wait between a and b minutes is,

math equation

where lambda is the so-called rate parameter, measuring the average number of beverages served per minute.  The integrand there is the probability density function, because…well, it’s a density.  This is an example of an exponential distribution.

My boss once joked that when Kash gets his coffee, everyone wins.  You should, in fact, check that I do eventually get my coffee and my colleagues are happy, i.e. “probabilities sum to 1”,

math equation

One can show more generally that,

math equation

so that if they serve 3 beverages every 5 minutes, I’m waiting at least 1 minute with 54% chance, at least 2 minutes with 30% chance, and at least 5 minutes with 5% chance.

But what’s wrong with the reasonable suggestion that we use a probability mass function here?  That is, why can’t we ask if there exists a function h(x) such that

math equation
math equation

THE PROOF

Now you’re all caught up.  That uncountable sum of h(x) up there is infinite.  If it sums to infinity over [a,b], then it definitely sums to infinity everywhere, so it prohibits the whole “probabilities sum to 1” thing.

The essence of the proof is in the below picture [those more mathematically inclined can check out my video below where I go through it in full rigor].

Suppose this is our h(x).  Notice something really important: if h(x) exceeds 1 (orange, y-axis), then these terms must contribute at least 1 to our sum, and if there are infinitely many such terms (orange, x-axis, A_0) then this blows up our sum.

Same thing happens if our h(x) is between 1/2 and 1 (green, y-axis): these terms must contribute at least 1/2 to our sum, and if there are infinitely many such terms (green, x-axis, A_1) then this also blows up our sum.

Same thing happens if our h(x) is between 1/3, and 1/2 (purple, y-axis, x-axis, A_2)….

See where this is going?

If this keeps happening, our graph of h(x) can’t really look like this.  In fact, the above argument forces each of those sets A_0, A_1, A_2, … to be finite.  But if they are finite, they can never really “fill up” [a,b], so it’s hopeless to hope for an h(x) which is positive everywhere.

The integral factor - PROBABILITIES

Ok, time for the video for the proof.  Please note the pre-req’s listed below it.

View the Video 

Prerequisites/concurrent learning:  set builder notationsupremumseries and sequences

2 Responses

  1. Wow, this post has given me useful info and answered some of my questions. I hope to give something back and aid others like you helped me. Feel free to surf my website Webemail24 about Search Engine Optimization.

Leave a Reply

Your email address will not be published. Required fields are marked *