# Conditional Expectation as Projection

This short note gives a geometric perspective on conditional expectation, and a nice application of this fact to regression.

### 1. Conditional Expectation as Projection

Theorem 1. Let ${(\Omega, \mathcal{F}, \mathbb{P})}$ be a probability space, let ${\tilde{\mathcal{F}}\subset \mathcal{F}}$ be a sub ${\sigma}$-algebra of ${\mathcal{F}}$, and let ${X}$ be an ${L^2}$ random variable. Then ${\mathbb{E}(X\mid \tilde{F})}$ is the projection of ${X\in L^2(\mathcal{F})}$ onto ${L^2(\tilde{F})}$.

Before we prove this, let’s set the measure theory aside for a second and think intuitively: a common way to introduce conditional expectation without measure theory is to call it “the best guess you would make for ${X}$ given some information.” With this in mind, it makes sense that the conditional expectation would be the point in the information you have that is “closest” to ${X}$. But the nice thing is that this isn’t just a cute analogy – it’s a precise fact.

Proof: The norm in ${L^2(\Omega)}$ is the square integral, so it’s sufficient to show that

$\displaystyle \mathbb{E}(X\mid \tilde{F})=\arg\min \mathbb{E}(X-Y)^2.$

To do this, we’ll need a lemma.

Lemma 2. If ${Z\in L^2(\tilde{F})}$, then ${\mathbb{E}(Z(X-\mathbb{E}(X\mid \tilde{\mathcal{F}})))=0.}$

Proof: By Cauchy-Schwarz, ${ZX\in L^1(\Omega)}$, so we have

$\displaystyle Z\mathbb{E}(X\mid\tilde{F})=\mathbb{E}(ZX\mid \tilde{F})\implies \mathbb{E}(Z\mathbb{E}(X\mid \tilde{F}))=\mathbb{E}(\mathbb{E}(ZX\mid\tilde{\mathcal{F}}))=\mathbb{E}(ZX),$

and then subtracting gives the result. $\Box$

With this lemma in hand, take some ${Y\in L^2(\tilde{\mathcal{F}})}$. Then, we have

$\displaystyle \mathbb{E}((X-Y)^2)=\mathbb{E}((X-\mathbb{E}(X\mid\tilde{\mathcal{F}})+\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y)^2),$

and expanding this gives

$\displaystyle \mathbb{E}((X-\mathbb{E}(X\mid \tilde{F}))^2)+ 2\mathbb{E}([X-\mathbb{E}(X\mid \tilde{\mathcal{F}})][\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y])+\mathbb{E}((\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y)^2).$

The cross-term vanishes because of our lemma. The first term does not depend on ${Y}$. The second is clearly always positive, and is zero when ${Y=\mathbb{E}(X\mid \tilde{\mathcal{F}})}$, so this must be the minimum. The result follows. $\Box$

### 2. Regression

This new paradigm has at least one nice application: understanding linear regression better.

Suppose you have some data points that are generated from jointly distributed random variables ${X}$ and ${Y}$, which satisfy ${Y=\alpha X+\beta}$, where ${\alpha, \beta}$ are constants. We observe a bunch of data from this distribution, ${(\mathbf{x}_1, y_1),\cdots, (\mathbf{x}_n, y_n)}$, where ${y_i=a\mathbf{x}_i+b+\epsilon}$ and ${\epsilon\sim N(0,1)}$ is Gaussian noise.

We call ${\mathbb{E}(Y\mid X)=\alpha X+\beta}$ the true regression function, both because it is the true function and because a direct computation gives that it is the function that minimized the expected squared error of a future prediction. In practice, we don’t know the distributions of ${X, Y}$, so we have to approximate the true regression function with an empirical regression function, ${\widehat{\mathbb{E}(Y\mid X)}=aX+b}$.

To approximate the true regression function with the data, we want to approximate the projection of ${Y}$ onto the ${\sigma}$-algebra generated by ${X}$. But we have no idea what this ${\sigma}$-algebra looks like, except that it contains the data points we saw, so we just consider the space consisting of those data points instead. Then, the approximate projection will be found by choosing ${a, b}$ to minimize the ${L^2}$ distance between our empirical regression function and the approximate ${y}$-values (we use these because we don’t know the exact ${y}$-values). This means choosing ${a,b}$ to minimize

$\displaystyle \sum_{i=1}^n (a\mathbf{x}_i+b-y_i)^2.$

But this is exactly the same as linear least squares regression. The fact that taking this statistical perspective about regression gives the same result as just deciding to minimize the sum of the squares (which can be derived in other ways as well) is not a coincidence, but rather a consequence of the fact that conditional expectations are also projections.