This short note gives a geometric perspective on conditional expectation, and a nice application of this fact to regression.
1. Conditional Expectation as Projection
Theorem 1. Let be a probability space, let be a sub -algebra of , and let be an random variable. Then is the projection of onto .
Before we prove this, let’s set the measure theory aside for a second and think intuitively: a common way to introduce conditional expectation without measure theory is to call it “the best guess you would make for given some information.” With this in mind, it makes sense that the conditional expectation would be the point in the information you have that is “closest” to . But the nice thing is that this isn’t just a cute analogy – it’s a precise fact.
Proof: The norm in is the square integral, so it’s sufficient to show that
To do this, we’ll need a lemma.
Lemma 2. If , then
Proof: By Cauchy-Schwarz, , so we have
and then subtracting gives the result.
With this lemma in hand, take some . Then, we have
and expanding this gives
The cross-term vanishes because of our lemma. The first term does not depend on . The second is clearly always positive, and is zero when , so this must be the minimum. The result follows.
This new paradigm has at least one nice application: understanding linear regression better.
Suppose you have some data points that are generated from jointly distributed random variables and , which satisfy , where are constants. We observe a bunch of data from this distribution, , where and is Gaussian noise.
We call the true regression function, both because it is the true function and because a direct computation gives that it is the function that minimized the expected squared error of a future prediction. In practice, we don’t know the distributions of , so we have to approximate the true regression function with an empirical regression function, .
To approximate the true regression function with the data, we want to approximate the projection of onto the -algebra generated by . But we have no idea what this -algebra looks like, except that it contains the data points we saw, so we just consider the space consisting of those data points instead. Then, the approximate projection will be found by choosing to minimize the distance between our empirical regression function and the approximate -values (we use these because we don’t know the exact -values). This means choosing to minimize
But this is exactly the same as linear least squares regression. The fact that taking this statistical perspective about regression gives the same result as just deciding to minimize the sum of the squares (which can be derived in other ways as well) is not a coincidence, but rather a consequence of the fact that conditional expectations are also projections.