Framework for describing systems of imperfect knowledge

Examples : the future.

A bit sequence that we know little about.

A bit sequence some of which has not been created yet.

Sample space

A probability is assigned to points of a sample space or intervals or neighborhoods in continuous case.

So sample space might be all bit streams of length N, It has 2N points.

Uniform distribution: all points have the same probability, of 2^(-N).

The Probability of point A represents the proportion of the time your bit stream will be

bit stream A.

Probabilities are always non-negative, and sum to 1, which means your bit stream will

always be some bit stream in the sample space.

When you know nothing you generally assume the uniform distribution.

Event: a property of some points in the sample space: if you prefer it is a subset of the sample space.

Examples : The third bit in the bit stream is 1; the number of 1~Rs in the stream is 7. The number of

1~Rs is odd.

Probability of an event E is the sum of the probabilities of the points in the subsets that is E or on which E holds.

Random Variable: a (usually numerical) function defined on the points of the sample

space.

Expectation of a random variable: The sum of its value on point A of the sample space multiplied by the probability of that point, over all points A of the sample space.

The expectation of the random variable f is usually denoted by <f> or E(f) or sometimes f with a hat on it or f with a straight bar over it.

Notice that the expectation of a variable is linear in that variable; if you multiply the variable by 7 its expectation gets multiplied by 7. In other words, <7X> = 7<X>.

Indicator random variable of event E: function that is 1 where the event E happens and 0 elsewhere.

Sometimes Useful Fact 1: The probability of an event is the expectation of its indicator variable.

Useful Fact 2: The expectation of a random variable f is the sum over its values v of the indicator variable for the event that f is v. Why is this so? Well, it is equal to v exactly when f is v.

Expectation is sometimes called “average” or “mean.”

The variance of a random variable is the expectation of the square of the difference between it and its expectation. If you prefer, it is the mean of the square of its deviation from the mean, or its “mean square deviation.”

We can denote it by <(f - <f>)2> or <f2 -2f<f> + <f>2> .

Now suppose we have several random variables, and/or several events. We ask what can we say about expectations and probabilities of combinations of these events?

Two events are mutually exclusive if they do not both happen in any point of the sample space.

Events E and F are independent, if the probability of E occurring in the whole sample space is the same as its probability in the reduced sample space consisting of those points for which F occurs (and vice versa).

The probability of E on that reduced space, is called the probability of E given F, and is usually written as Pr(E/F).

It is the probability that both E and F happen on the same point, divided by the probability that F occurs. The latter factor is required so that it will sum to 1.

In symbols this reads Pr(E/F) = Pr(E and F)/ Pr(F)

Useful Fact 3: If E and F are independent this must be Pr(E) so we get Pr(E and F) = Pr(E)*Pr(F) .

Random Variables are independent if knowing the value of one does not change the probabilities of the other variable being whatever.

If we use fact 2 above we can deduce that the expectation of the product of independent random variables f and g is the product of their expectations (by applying Fact 3 for each pair of values of f and g).

This last fact is an extremely useful fact. E(fg) = E(f)E(g).

Still another extremely useful fact is that the expectation of the sum of two (or more) random variables is the sum of their expectations. This is another consequence of the linearity of expectations and is true for ALL pairs of variables without exception.

If two (or more) random events are mutually exclusive, then the probability of either occurring is the sum of their individual probabilities.

If A and B are not mutually exclusive, the sum of their probabilities counts points in the sample space in which both occur twice. So we get

Pr(A or B) = Pr(A) + pr(B) - Pr(A and B)

We can generalize this statement when there are more variables, A B C and D for example. The result is called: The Principle of Inclusion and Exclusion and is also useful. For three variables we get

Pr(A or B or C) = Pr(A) + Pr(B) + Pr(C) - Pr(Aand B) - Pr(B and C) - PrA and C) + Pr(all three)

The first three terms count points of the sample space in which all events occur 3 times, the next terms remove them 3 times so we have to add them back in the last term to count them once.

So with many variables, the probability of any happening at all is the sum of the probabilities that each odd number of them happens, less the probability that each even number of them happens.

The most important property concerns the variance of the sum of independent variables.

The variance of f as we have seen, obeys

Var(f ) = <(f - <f>)^2> = <f^2 - 2f<f> + <f>^2> .

If we use the fact that the expectation of a sum is the sum of expectations and the fact the expectation of a constant multiple of a variable is the same multiple of its expectation to simplify, we get

var(f ) = <f ^2> - <f>^2

We can apply this to compute var (f+g). This has terms quadratic in f which form var(f), terms quadratic in g which form var(g), and cross terms, which are

2(<f g> - <f><g>)

This quantity is twice the covariance of f and g, so we have

Var(f+g) = Var(f )+ Var(g) + 2 Cov (f,g)

And furthermore, we know from above that the covariance of independent variables is 0, ant the variance of their sums the sum of their variances.

This is an important and surprising statement because the covariance is quadratic and the behavior shown here by independent variables is what you would expect from linear operations, not quadratic ones.

What does all this mean?

So far not much. However, suppose we repeat an experiment over and over again, say M times.

So we have sequences of events and of independent and identically distributed random variables associated with the sequence.

Suppose the probability of some event E in one experiment is p.

Then the expected value of its indicator is p.

The variance of its indicator is p2(1-p)+ p(1-p)2 or p(1-p).

The number of occurrences of E in the sequence is the sum of the indicators for each experiment.

Its expected value is therefore Np.

And its variance is Np(1-p).

The expected value of the proportion of the trials in which E happens is therefore p,

and the variance of this proportion (remember variance is quadratic) is p(1-p)/N.

This obviously goes to 0 as N increases.

So what does this tell us:

One more tool - Look at the definition of the variance V.

V = Sum (dev)^2 p(dev)

For any x divide the sum into contributions for |dev| >x and other, Ignore the other, which has to be non-negative. Notice that (dev)^2 is always at least x^2 in the remaining sum.

We then have V >= x^2 Pr(|dev|>x) (Tchebyshev Inequality)

Pr(|dev|>x) <= Vx^(-2)

Thus, if V goes to 0, the probability that the deviation does not go to 0 goes to 0.

Conclusion: the probability that the proportion of occurrences of E does not approach p goes to zero. This is the weak law of large numbers.

Interpretation of the probability of an event. Suppose the event E is a bit stream of length 8 which has four ones in it.

The probability of any one sequence where the event E happens is

P^4(1-p)^4

The probability of exactly 4 occurrences of 1 in a bit stream of length 8 is C(8,4)*P^4(1-p)^4, where C(n,k) is n factorial divided by k factorial divided by (n-k) factorial (also called “8 choose 4”).

General statement:

Limit of that is a Normal distribution, which is also known as the Gaussian distribution.

Whenever you form the sum of independent random variables whose variances are more or less comparable, (no small number dominate) it will have close to a normal distribution. This is called the Central Limit Theorem.

Caution: if your variable is the product of lots of independent variables its logarithm will have a normal distribution, not itself.