Patterns of Ideas

"A mathematician, like a painter or poet, is a maker of patterns. If his patterns are more permanent than theirs, it is because they are made with ideas." – G.H. Hardy

Multiple Testing Procedures

Here’s something problematic: let’s say you run a hypothesis test at a significance level of {\alpha=0.05}. Then, assuming you ran the test correctly and met assumptions, the probability of a type I error is only 5%. But if you instead run 100 tests at a significance level of {\alpha=0.05}, the probability of making at least one type I error soars to

\displaystyle 1-(1-0.05)^{100}\approx 99\%,

and this only gets worse as you run increasingly many tests. This is the multiple testing problem, and this blog post discusses a couple of solutions to it as well as why they work. Most of this is taken from my notes for UChicago’s STAT 278, which I’d highly recommend to anyone interested in these sorts of things.

The set up for the rest of this post is as follows. We have p-values {p_1,\cdots, p_n} that we’ve acquired from some hypothesis tests. Some of these come from true null hypotheses, and are distributed uniformly on {[0,1]}. The rest come from false null hypotheses, and we know nothing about their distribution.

We want to run a test that will have a significance level of {\alpha}, not for each test individually, but for all of them together. This will turn out to mean different things for different methods.

1. Bonferroni Correction

This is a method for testing the global null hypothesis, which is the hypothesis that all of the null hypotheses that generated our p-values are true (so we take the intersection of the relevant events). We do this in the most naive way possible, and it turns out to work.

We just test each of the {p_i} at a significance level of {\alpha/n}, and reject the global null if any of the {p_i} are rejected. That is, we reject the global null if {p_i\leq \alpha/n} for some {i} and fail to reject otherwise.

In what sense does this work?

Proposition 1. When testing the global null hypothesis using Bonferroni correction, the probability of making a type I error is {\leq\alpha}.

Proof: This is direct. The probability that we reject the global null is

\displaystyle \mathop{\mathbb P}\left(\min_{1\leq i\leq n} p_i\leq \frac{\alpha}{n}\right)=\mathop{\mathbb P}\left(\bigcup_{i=1}^n \left\{p_i\leq \frac{\alpha}{n}\right\}\right),

which by the union bound is

\displaystyle \leq \sum_{i=1}^n\mathop{\mathbb P}\left(p_i\leq \frac{\alpha}{n}\right)=n\cdot \frac{\alpha}{n}=\alpha,

so we have FWER control. \Box

So that’s nice. But it’s also extremely conservative, and we can afford to do things that are a bit more powerful.

2. Simes’ Test

Here we’re also testing the global null, but a bit more cleverly. The Simes’ test rejects the global null if there is 1 p-value that is {\leq \frac{\alpha}{n}}, or 2 p-values that are {\leq \frac{2\alpha}{n}}, or in general, if there are {k} p-values that are {\leq \frac{k\alpha}{n}} for any {1\leq k\leq n}. This will reject when Bonferroni rejects, but also in some other cases, so its more powerful. This power costs something, though: we need an assumption of independence to get a bound, and this bound isn’t as strong as for Bonferroni.

Proposition 2. If the p-values are independent, then when testing the global null hypothesis using Simes’ test, the probability of making a type I error is {\leq\alpha H_n\approx \alpha \log n}.

Proof: We assign a score to each {p}-value that measures what percent of the way it gets us to rejecting the global null. Let

\displaystyle s_i=\left\{\begin{aligned} 1&\text{ if }&& p_i\leq \frac{\alpha}{n}\\ \frac{1}{2}&\text{ if }&& \frac{\alpha}{n}\leq p_i\leq \frac{2\alpha}{n}\\ &\vdots \\ \frac{1}{n}&\text{ if }&& \frac{\alpha(n-1)}{n}\leq p_i\leq {\alpha}\\ \\ 0&\text{ if } &&p_i>\alpha\end{aligned}\right.

Then, Simes rejects if and only if, for some {k}, there are {k} {p_i}‘s that are {\leq \frac{k\alpha }{n}}, which means that the sum of the {s_i} is {\geq 1}. So now we can calculate the probability of that happening.

The expected value of any given score is

\displaystyle \mathop{\mathbb E}[s_i]=1\cdot \frac{\alpha}{n}+\cdots +\frac{1}{n}\cdot \frac{\alpha}{n}=\frac{\alpha}{n}H_n.

Then,

\displaystyle \mathop{\mathbb P}(\text{Simes rejects})\leq \mathop{\mathbb P}\left(\sum_{i=1}^m s_i\geq 1\right)\leq \mathop{\mathbb E}\left[\sum_{i=1}^n s_i\right] =n\cdot \frac{\alpha}{n}H_n =\alpha H_n.

\Box

Note that for large {n}, the {\log n} factor counts a lot, and the bound is sharp. The equality construction is a bit of a pain, but the idea is to choose the joint distribution of the {p_i} to make them negatively dependent.

3. Holm-Bonferroni

This test is different from the previous two in that it does not test the global null, but instead individually tests each of the {p_i} and then controls the family-wise error rate (FWER), which is the probability of making at least one type I error across all the p-values. As an added bonus, we can accept/reject individual p-values rather than accepting/rejecting them all at once, so we have a much better sense of how significant the results are, taken together.

The idea of Holm-Bonferroni is to run Bonferroni until acceptance. We first run Bonferroni on {p_1,\cdots, p_n}. If we do not reject the global null, we stop. If we do, then we reject the smallest p-value, and repeat this procedure with the other p-values, continuing until we do not reject the global null.

Equivalently, if we order the p-values as {p_{(1)},\cdots, p_{(n)}}, this is the same as checking if

\displaystyle p_{(k)}\leq \frac{\alpha}{n+k-1}

for each {k=1,\cdots, n} and rejecting all the p-values before the first {k} that fails this test.

Proposition 3. For Holm-Bonferroni, we have {\text{FWER}\leq \alpha}.

Proof: Let {n_0=\#\text{ of nulls}}. Then

\displaystyle \mathop{\mathbb P}\left(\min_{\text{nulls }i} \leq \frac{\alpha}{n_0}\right)\leq \alpha,

from a union bound. In this case, we might have false rejections, but this only happens {\alpha} of the time. So we show that the other {1-\alpha} of the time, there are no false rejections.

Suppose {p_i>\frac{\alpha}{n_0}} for all nulls {i}. Then, take the ordered {p}-values {p_{(1)},\cdots, p_{(n)}}. Let {p_{(k)}} be the first {p}-value that comes from a null. We know that this {p}-value can’t be too small, and we know it can’t happen super late in the list, because there are at most {n-n_0} non-nulls. This means that {k\leq n-n_0+1}. There are two cases now: the procedure stops before step {k} (which means no false rejections because no nulls before {k}). Or, it gets to step {k}, and we check if {p_{(k)}\leq \frac{\alpha}{n+1-k}}. But we know that {p_{(k)}>\frac{\alpha}{n_0}} and {k\leq n-n_0+1}, so the {\frac{\alpha}{n+1-k}\leq \frac{\alpha}{n_0}}. By our assumption on the null {p_i} then, we stop at step {k}, and there are no false rejections. \Box

As with the Bonferroni correction, we don’t need any assumptions of independence.

4. Benjamini-Hochberg

This is another multiple testing procedure, but instead of controlling the FWER, it controls

\displaystyle \text{FDR}=\mathop{\mathbb E}\left[\frac{\#\text{ null p-values rejected}}{\#\text{ p-values rejected}}\right],

the false discovery rate. The quantity we’re taking the expectation of is the false discovery proportion (FDP).

Here, we build on Simes’ test by running Simes’, and then rejecting those {k} p-values that are {\leq \alpha\frac{k}{n}}, for the maximum {k} we can do this for. Equivalently, we choose some threshold {\alpha\frac{k}{n}}, and then reject all the p-values below this threshold.

Proposition 4. If the p-values are independent, the Benjamin-Hochberg gives {\text{FDR}=\alpha\frac{n_0}{n}\leq \alpha}.

Proof: Suppose we make {\hat{k}} rejections, so our threshold is {\alpha\frac{\hat{k}}{n}}, meaning that {p_i} is rejected iff {p_i\leq \alpha\frac{\hat{k}}{n}}. The trouble is that here, the RHS also depends on {p_i}, so we have to go through some gymnastics to get around this.

Let {\hat{k}_i} be the largest {k} such that there are {k-1} many p-values {p_j} with {j\neq i} that are {\leq \alpha\frac{k}{n}}. Call this statement the {\text{BH}_i} statement, and the statement that there are {k} many p-values {\leq \alpha\frac{k}{n}} the BH statement.

Claim. If {p_i} is rejected, then {\hat{k}=\hat{k}_i}.

Call the first statement the BH statement, and the second one (with {k-1}) the BH{_\text{i}} statement.

Now, suppose {p_i} is rejected. Then {p_i \leq \alpha \frac{k}{n},} and BH holds for {k=\hat{k}}, so there are {\hat{k}-1} of the {p_j}, {j\neq i}, that are also {\leq \alpha \frac{\hat{k}}{n}}. Thus BH{_{\text{i}}} holds for {k=\hat{k}}, and so {\hat{k}_i\geq \hat{k}}.

On the other hand, {p_i\leq \alpha \frac{\hat{k}}{n}\leq \alpha \frac{\hat{k}_i}{n}}, so BH{_{\text{i}}} is true at {k=\hat{k}_i}, which means there are {\hat{k}_i} values that are {\leq \alpha \frac{\hat{k}_i}{n}}, so BH is true at {k=\hat{k}_i}, and thus {\hat{k}\geq \hat{k}_i}. It follows that {\hat{k}=\hat{k}_i}

With this claim, we can move onto the main result. We have

\displaystyle \text{FDR}=\mathop{\mathbb E}[\text{FDP}]=\mathop{\mathbb E}\left[\frac{\sum_{i\text{ null}} \mathbf{1}\{p_i\text{ is rejected}\}}{\max(1, \hat{k})}\right]=\sum_{i\text{ null}} \mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ is rejected}}{\max(1,\hat{k})}\right].

We can replace the denominator with {\hat{k}_i} without changing the sum, because when they differ, the numerator is 0, so it doesn’t matter. Then, this is

\displaystyle \sum_{i\text{ null}}\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ rejected}\}}{\hat{k}_i}\right]\leq \sum_{i\text{ null}}\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\leq \alpha \frac{\hat{k}_i}{n}\}}{\hat{k}_i}\right].

This is an inequality because if the first numerator is 1, the second is definitely 1, but if the first is 0, the second might not be.

Now, we use the tower law to write this as

\displaystyle \sum_{i\text{ null}} \mathop{\mathbb E}\left[\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\leq \alpha\cdot \hat{k}_i/n\}}{k_i}\Big| p_j, j\neq i\right]\right].

That conditional expectation is just

\displaystyle \frac{\mathop{\mathbb P}\{p_i\leq \alpha\hat{k}_i/n\mid p_j, j\neq i\}}{\hat{k}_i}=\frac{\alpha \hat{k}_i/n}{\hat{k}_i}=\alpha/n.

This is a constant, and the expected value of that is just {\alpha/n}. When you add this over all the null terms, you get {\alpha\frac{n_0}{n}}, as desired. \Box

Here, we actually get a better bound than {\alpha}, which might seem like a good thing, but also means we’re not making as many discoveries as we could be. We can correct for this by estimating {n_0}, and then running BH with {\alpha\frac{n}{n_0}}, so that the FDR for this procedure will be {\leq \alpha}.

Although we used independence here, we can make do without it, albeit at the cost of a logarithmic factor.

Proposition 5. When running Benjamini-Hochberg, we have {\text{FDR}\leq \alpha\frac{n_0}{n}H_n}.

Proof: As before, we define a score {s_i} for each {p_i} that is 1 if {0\leq p_i\leq \alpha \frac{1}{n}}, {\frac{1}{2}} if {\alpha\frac{1}{n}<p_i\leq \alpha \frac{2}{n},} and so on, giving a score of 0 if {p_i>\alpha}.

We proved above that for a null {p}-value, the expected score is

\displaystyle \frac{\alpha}{n}\left(1+\cdots+\frac{1}{n}\right).

Now, we write

\displaystyle \text{FDR}=\sum_{i\text{ null}} \mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ rejected}\}}{\max(1, \hat{k})}\right].

If {p_i} is rejected, then {p_i\leq \alpha \hat{k}/n}, so its score is {\geq 1/\hat{k}}. Put another way,

\displaystyle \frac{\mathbf{1}\{p_i\text{ rejected}\}}{\max(1, \hat{k})}\leq s_i.

Thus,

\displaystyle \text{FDR}\leq \sum_{i\text{ null}} \mathop{\mathbb E}[s_i]=\alpha\frac{n_0}{n}\left(1+\cdots+\frac{1}{n}\right).

\Box

Thoughts on Live TeXing

I decided the summer before coming to college that I wanted to live-TeX notes for my math classes, and dutifully set out to seek advice from the Internet. The results were uncharacteristically disappointing. Only a handful of blog posts and articles discuss the subject, and that handful really only addresses the logistics involved. So this post is a more spiritual take on live-TeX’ing, based on my experiences.

(Caveat emptor: my experiences are probably not universal. Your mileage may vary.)

Before I say anything about TeX’ing, a quick summary of the alternatives. Taking notes on paper, although apparently effective for some people, doesn’t work for me because of my poor handwriting. Not taking notes, although certainly effective in some situations, doesn’t seem like a good idea in every situation. So for me, the choice is between TeX’ing, and sitting back and listening.

TeX’ing notes has the advantage of forcing you to pay attention to the superstructure of a lecture. There’s usually more to a lecture than the union of all the theorems, definitions, and proofs, and typing things like “We want this to be true in general, but…,” or “The punchline is that…,” or “With this example in mind, we define…” quickly draws your attention to them. You could certainly do this with a pen and paper too, but they’re usually too slow to give you time to jot down these longer ideas (unless that’s all you do). The speed of typing gives you extra time to spend on exposition, and crafting this exposition on the fly solidifies your ideas about the material.

This advantage is tempered though, by the difficulties that can arise from TeX’ing long technical arguments. One flavor of such arguments, that involves repeated manipulations of a single expression, is very amenable to being TeX’ed – copy-pasting is significantly faster than rewriting the entire expression. But another flavor involves many different expressions that hardly resemble each other, and these are deeply problematic, especially if matrices rear their ugly head or if subscripts are floating around. I’ve made the mistake before of getting caught up in transcription, only to realize that I have no clue what the math I’m transcribing means.

Another issue is that superstructure isn’t always what you should be paying attention to in a lecture. There’s a certain kind of wild, edge-of-your-seat lecture, where it’s all you can do to hang on and follow the speaker line by line. In these talks, trying to suss out those big ideas not only fails, but actually detracts from your ability to understand the basic material. On the other end of the spectrum, some lectures are on material that you’ve seen or are completely familiar with, and have already internalized the yoga of, so this advantage completely evaporates.

A great deal also depends on the lecturer. Obviously good lecturers are easy to take notes on, and bad lecturers hard, but many otherwise insignificant quirks can make a particular lecturer a pain to live-TeX.

Some lecturers are fond of nesting things. They’ll start a theorem, realize they need a definition, make a few remarks about the definition, maybe prove a claim. This is all fine at the board, but if you’re working (as I often do) with boxes around theorems and definitions on the page, it doesn’t translate well at all.

Other lecturers are prone to making corrections to their board work that are easy at the board, but hard on a computer. Swapping entries in a large matrix, adding forgotten tildes, or erasing and rewriting part of an expression while verbally explaining the manipulation all fall into this category.

And finally, there are lecturers who draw pictures. Simpler pictures, or complicated pictures drawn once and left alone are usually fine, and can be recreated afterward if they’re essential, but sometimes you have lectures that basically consist of progressively adding to a single image over the course of an hour, and these are really a lost cause. I have no idea how to handle them.

Of course, it’s very hard to synthesize all of these ideas into a hard and fast rule for deciding whether or not to TeX a talk. The system I’ve developed is that unless I’m very confident that it won’t work out, I’ll err on the side of trying to TeX, but am also very forgiving of myself for giving up halfway through a talk. It’s not a perfect system, but it gives reasonable results, especially as I become more aware of the red flags that tell me to stop TeX’ing.

Martingale Convergence

Last quarter, I did a DRP  on martingales, and at the start of this quarter, gave a short talk on what I learned. Here are the notes from that talk.

1. Conditional Expectation

Definition 1. Suppose {(\Omega, \mathcal{F}_o, \mathbb{P})} is a probability space, {\mathcal{F}\subset \mathcal{F}_o} is a sub {\sigma}-algebra, and {X} is a random variable.

  1. The conditional expectation of {X} given {\mathcal{F}}, {E(X\mid \mathcal{F})}, is any {\mathcal{F}}-measurable random variable {Y} such that

    \displaystyle \int_A X\, d\mathbb{P}=\int_A Y\, d\mathbb{P}\text{ for all } A\in \mathcal{F}.

  2. If {Y} is another random variable, then {E(X\mid Y)} is defined as {E(X\mid \mathcal{F})} where {\mathcal{F}} is the {\sigma}-algebra generated by {Y}.

Fact. The conditional expectation exists, is unique, and is integrable.

Intuitively, we can think of {E(X\mid \mathcal{F})} as the best guess of the value of {X} given the information available in {\mathcal{F}}.

Example 1.

  1. If {X} is {\mathcal{F}}-measurable, then {E(X\mid \mathcal{F})=X}.
  2. If {X} and {Y} are independent, then {E(X\mid Y)=E(X)}.
  3. Let {X_1,\cdots} be random variables with mean {\mu}, and let {S_n=X_1+\cdots+X_n} be the partial sums. Then, if {m>n},

    \displaystyle E(S_m\mid S_n)=E(X_m+\cdots+X_{n+1}+S_n\mid S_n)=E(X_m\mid S_n)+\cdots+E(X_{n+1}\mid S_n)+E(S_n\mid S_n)=\mu(m-n)+S_n.

    The idea is that you know your position at {S_n}, and you take {m-n} steps whose sizes are, on average, {\mu}, so your best guess for your position is {S_n+\mu(m-n)}.

2. Martingales

A martingale is a model of a fair game.

Definition 2. Consider a filtration (increasing sequence of {\sigma}-algebras) {\mathcal{F}_n} and a sequence of random variables {X_n}, each measurable with respect to {\mathcal{F}_n} and integrable. Then, if {E(X_{n+1}\mid \mathcal{F}_n)=X_n} for all {n}, we say {S_n} is a martingale.

Example 2. Let {X_1,\cdots} be random variables that take only the values {-1} and {1} with probability {1/2} each. Then {S_n=X_1+\cdots+ X_n} is a martingale, because

\displaystyle E(S_{n+1}\mid S_n)=E(X_{n+1}+S_n\mid S_n)=E(X_{n+1}\mid S_n)+E(S_n\mid S_n)=E(X_{n+1})+S_n=S_n.

Theorem 3. If {X_n} is a martingale with {\sup E(X_n^+)<\infty}, then {X_n} converges a.s. to a limit {X} with {E(|X|)<\infty}.

I’ll only sketch this proof, because even though the idea is nice, the details are a little annoying. The idea is to set up a way of betting on a martingale, show that you can’t make money off such a betting system, and then use this to draw conclusions about the martingales behavior.

Definition 4. Let {\mathcal{F}_n} be a filtration. A sequence {H_m} of random variables is said to be predictable if {H_m} is {\mathcal{F}_{m-1}} measurable for all {m}.

That is, the value of {H_m} can always be predicted with certainty at time {m}. Then, if we “bet” an amount {H_m} on the martingale {X_m}, our total winnings will be

\displaystyle (H\cdot X)_n=\sum_{m=1}^n H_m(X_m-X_{m-1}).

If {X_n} is a supermartingale and {H_n} is bounded, then {(H\cdot X)_n} is a supermartingale as well.

Now, fix an interval {(a,b)} and choose {H_n} in the following way: if {X_n<a}, then bet 1, and continue to do so until {X_n>b}. Then bet zero until you fall back below {a}, and repeat. Every time you go from below {a} to above {b}, you will make a profit of at least {(b-a)}. Let {U_n} be the number of these upcrossings that occur. Then, you can use the previous fact, together with the bound on {\sup E(X_n^+)}, to show that the number of these upcrossings is finite with probability 1. Since this is true for arbitrary {(a,b)}, the {X_n} must converge.

Finally, we give a brief application of the martingale convergence theorem.

Proposition 5. Let {X_n} be a sequence of random variables such that {P(X_n=1)=P(X_n=-1)=\frac{1}{2}.} Then {{\sum_{j=1}^{\infty} \frac{X_j}{j}}} converges with probability 1.

Proof: Let {{S_n=\sum_{j=1}^n\frac{X_j}{j}}.} It’s easy to check that {S_n} is a martingale, and that {E(S_n)} is bounded (in fact, it’s always 0), so the {S_n} converge almost surely. Thus the random harmonic series converges almost surely. \Box

Mercer’s Theorem and SVMs

In a funny coincidence, this post has the same basic structure as my previous one: proving some technical result, and then looking at an application to machine learning. This time it’s Mercer’s theorem from functional analysis, and the kernel trick for SVMs. The proof of Mercer’s theorem mostly follows Lax’s Functional Analysis.

1. Mercer’s Theorem

Consider a real-valued function {K(s,t)}, and the corresponding integral operator {\mathbf{K}: L^2[0,1]\rightarrow L^2[0,1]} given by

\displaystyle (\mathbf{K} u)(s)=\int_0^1 K(s,t) u(t)\, dt.

We begin with two facts connecting the properties of {K} to the properties of {\mathbf{K} }.

Proposition 1. If {K} is continuous, then {\mathbf{K} } is compact.

Proof: Consider a bounded sequence {\{f_n\}_{n=1}^{\infty} \subset L^2[0,1]}. We wish to show that the image of this sequence, {\{\mathbf{K} f_n\}_{n=1}^{\infty}}, has a convergent subsequence. We show that {\{\mathbf{K} f_n\}} is equicontinuous, and Arzela-Ascoli then gives a uniformly convergent subsequence, which in turns gives that {\mathbf{K} } is compact.

Since {K(x,y)} is continuous, for every {\epsilon>0}, there exists {\delta>0} such that

\displaystyle |x-\tilde{x}|<\delta \implies |K(x,y)-K(\tilde{x}, y)|\leq \epsilon.

Then, if {|x-\tilde{x}|<\delta}, we have that

\displaystyle |\mathbf{K} f_n(x)- \mathbf{K} f_n(\tilde{x})| = \left| \int_0^1 K(x, t) f_n(t)\, dt \right| -\left| \int_0^1 K(\tilde{x}, t) f_n(t)\, dt \right|

which is bounded by

\displaystyle \int_0^1 |K(x,t)-K(\tilde{x},t)| |f_n(t)| \, dt \leq \epsilon \int_0^1 |f_n(t)|\, dt\leq \epsilon M,

where {M} is an upper bound on the norm of the {f_n}. The equicontinuity of the {\mathbf{K} f_n}, and thus the compactness of {\mathbf{K} }, follows. \Box

Proposition 2. If {K} is symmetric, then {\mathbf{K} } is self-adjoint.

Proof: We directly compute that

\displaystyle \langle \mathbf{K} u, v\rangle = \int_0^1\int_0^1 K(s,t)\, u(s) v(t)\, ds\, dt

and

\displaystyle \langle u, \mathbf{K} v\rangle = \int_0^1\int_0^1 K(t,s)\, u(s) v(t)\, ds\, dt,

which will be equal if {K(s,t)=K(t,s)}, so {\mathbf{K} } is self-adjoint in this case. \Box

Thus, when {K} is continuous and symmetric, the operator {\mathbf{K} } is a compact self-adjoint operator, and so we can apply the spectral theorem to find a set of orthonormal eigenfunctions {e_j} with real eigenvalues {\kappa_j} that form a basis for {L^2[0,1]}. The following theorem connects {K} to these eigenfunctions and eigenvalues in the case where {\mathbf{K} } is positive.

Theorem 3 (Mercer). If {K} is symmetric and continuous, and the associated operator {\mathbf{K} } is positive i.e. {\langle \mathbf{K} u, u\rangle \geq 0} for all {u}, then we can write

\displaystyle K(s,t)=\sum_{j=1}^{\infty} \kappa_j e_j(s)e_j(t),

where the {\kappa_j} and {e_j} are the eigenfunctions and eigenvalues of {\mathbf{K} }.

To prove this, we’ll need a lemma.

Lemma 4. If {\mathbf{K} } is positive, then {K} is non-negative on the diagonal.

Proof: Suppose for the sake of contradiction that {K(r,r)<0} for some {r}. Choose {s,t} so close to {r} that {K(s,t)} is negative, and choose {u} to be a bump function that is zero except in a small neighborhood around {(r,r)}. Then,

\displaystyle \langle \mathbf{K} u, u\rangle = \int_0^1\int_0^1 K(s,t) u(t) u(s)\, ds \, dt

will be negative, contradicting the positivity of {\mathbf{K} }. \Box

Proof of Mercer’s Theorem: Define a sequence of functions

\displaystyle K_N(s,t)=\sum_{j=1}^N \kappa_j e_j(s) e_j(t),

and let {\mathbf{K} _N} be the associated sequence of integral operators. We first show that the {K_N} converge uniformly.

The operator {\mathbf{K} -\mathbf{K} _N} has eigenfunctions {e_j} and eigenvalues {\kappa_j} for {j>N}, so it is a positive operator. But then {K-K_N} must be non-negative on the diagonal by our lemma, and so

\displaystyle K(s,s)-\sum_{j=1}^N \kappa_j e_j^2(s)\geq 0\implies K(s,s)\geq \sum_{j=1}^N \kappa_j e_j^2(s)

for all {N}. Since the sum on the right is always non-negative, it must converge for every {s}. This gives point-wise monotone convergence of the sequence of partial sums, which means they converge uniformly in {s}.

But now by Cauchy-Schwarz,

\displaystyle \left|\sum_{j=1}^N \kappa_j e_j(s) e_j(t)\right| \leq \left(\sum_{j=1}^N \kappa_j e_j^2(s)\right)^{1/2}\left(\sum_{j=1}^N \kappa_j e_j^2(t)\right)^{1/2},

and since both of the series on the right converge uniformly, the series on the left does as well.

Let {K_{\infty}} be the limit of the {K_N}. Then, by definition, {\mathbf{K} } and {\mathbf{K} _{\infty}} agree on all of the {e_j} and their linear combinations, and since the {e_j} span the space, this means {\mathbf{K} } and {\mathbf{K} _{\infty}} agree everywhere, and so they must be the same operator. But this means they have the same kernel, so {K= K_{\infty}}, completing the proof. \Box

2. The Kernel Trick

The basic idea of the kernel trick is this: let’s say you have some data points {\mathbf{x}_1, \mathbf{x}_2,\cdots, \mathbf{x}_N}. If you want to transform these data points by some map {\phi}, you have to apply {\phi} to all of them, and if you want to run an SVM, you have to compute all {N^2} dot products {\phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}_j)}. For large {N}, this can be a huge pain, so the dream is to choose {\phi} so that you can compute these dot products without actually having to transform the data i.e. find some {K} such that

\displaystyle \phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}_j)= K(\mathbf{x}_i, \mathbf{x}_j).

Mercer’s theorem allows us to do exactly this, but in the other direction – if we choose {K} so that the associated integral operator {\mathbf{K} } is positive definite, then we can write

\displaystyle K(\mathbf{x}_i, \mathbf{x}_j)=\sum_{j=1}^{\infty} \kappa_j e_j(s) e_j(t)= \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j)\rangle,

where

\displaystyle \phi: \left[\begin{array}{ccc} x_1& x_2&\cdots \end{array}\right]\mapsto \left[\begin{array}{ccc} \sqrt{\kappa_1} e_j(x_1)& \sqrt{\kappa_1} e_j(x_2)&\cdots \end{array}\right].

So instead of trying to choose {\phi} so that {K} exists, we just choose an appropriate {K}, and don’t even need to think about {\phi}. In particular, we can work with the data in another feature space without running directly into computational barriers.

Conditional Expectation as Projection

This short note gives a geometric perspective on conditional expectation, and a nice application of this fact to regression.

1. Conditional Expectation as Projection

Theorem 1. Let {(\Omega, \mathcal{F}, \mathbb{P})} be a probability space, let {\tilde{\mathcal{F}}\subset \mathcal{F}} be a sub {\sigma}-algebra of {\mathcal{F}}, and let {X} be an {L^2} random variable. Then {\mathbb{E}(X\mid \tilde{F})} is the projection of {X\in L^2(\mathcal{F})} onto {L^2(\tilde{F})}.

Before we prove this, let’s set the measure theory aside for a second and think intuitively: a common way to introduce conditional expectation without measure theory is to call it “the best guess you would make for {X} given some information.” With this in mind, it makes sense that the conditional expectation would be the point in the information you have that is “closest” to {X}. But the nice thing is that this isn’t just a cute analogy – it’s a precise fact.

Proof: The norm in {L^2(\Omega)} is the square integral, so it’s sufficient to show that

\displaystyle \mathbb{E}(X\mid \tilde{F})=\arg\min \mathbb{E}(X-Y)^2.

To do this, we’ll need a lemma.

Lemma 2. If {Z\in L^2(\tilde{F})}, then {\mathbb{E}(Z(X-\mathbb{E}(X\mid \tilde{\mathcal{F}})))=0.}

Proof: By Cauchy-Schwarz, {ZX\in L^1(\Omega)}, so we have

\displaystyle Z\mathbb{E}(X\mid\tilde{F})=\mathbb{E}(ZX\mid \tilde{F})\implies \mathbb{E}(Z\mathbb{E}(X\mid \tilde{F}))=\mathbb{E}(\mathbb{E}(ZX\mid\tilde{\mathcal{F}}))=\mathbb{E}(ZX),

and then subtracting gives the result. \Box

With this lemma in hand, take some {Y\in L^2(\tilde{\mathcal{F}})}. Then, we have

\displaystyle \mathbb{E}((X-Y)^2)=\mathbb{E}((X-\mathbb{E}(X\mid\tilde{\mathcal{F}})+\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y)^2),

and expanding this gives

\displaystyle \mathbb{E}((X-\mathbb{E}(X\mid \tilde{F}))^2)+ 2\mathbb{E}([X-\mathbb{E}(X\mid \tilde{\mathcal{F}})][\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y])+\mathbb{E}((\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y)^2).

The cross-term vanishes because of our lemma. The first term does not depend on {Y}. The second is clearly always positive, and is zero when {Y=\mathbb{E}(X\mid \tilde{\mathcal{F}})}, so this must be the minimum. The result follows. \Box

2. Regression

This new paradigm has at least one nice application: understanding linear regression better.

Suppose you have some data points that are generated from jointly distributed random variables {X} and {Y}, which satisfy {Y=\alpha X+\beta}, where {\alpha, \beta} are constants. We observe a bunch of data from this distribution, {(\mathbf{x}_1, y_1),\cdots, (\mathbf{x}_n, y_n)}, where {y_i=a\mathbf{x}_i+b+\epsilon} and {\epsilon\sim N(0,1)} is Gaussian noise.

We call {\mathbb{E}(Y\mid X)=\alpha X+\beta} the true regression function, both because it is the true function and because a direct computation gives that it is the function that minimized the expected squared error of a future prediction. In practice, we don’t know the distributions of {X, Y}, so we have to approximate the true regression function with an empirical regression function, {\widehat{\mathbb{E}(Y\mid X)}=aX+b}.

To approximate the true regression function with the data, we want to approximate the projection of {Y} onto the {\sigma}-algebra generated by {X}. But we have no idea what this {\sigma}-algebra looks like, except that it contains the data points we saw, so we just consider the space consisting of those data points instead. Then, the approximate projection will be found by choosing {a, b} to minimize the {L^2} distance between our empirical regression function and the approximate {y}-values (we use these because we don’t know the exact {y}-values). This means choosing {a,b} to minimize

\displaystyle \sum_{i=1}^n (a\mathbf{x}_i+b-y_i)^2.

But this is exactly the same as linear least squares regression. The fact that taking this statistical perspective about regression gives the same result as just deciding to minimize the sum of the squares (which can be derived in other ways as well) is not a coincidence, but rather a consequence of the fact that conditional expectations are also projections.

Sabermetrics, but for Pool

1. Introduction

I spend a lot of time playing pool during the school year. Unfortunately, I haven’t had access to pool tables for a while, so I haven’t been able to play. Since I can’t actually play pool, I decided instead to work on a problem involving it that occurred to me while playing a game earlier this year.

Some “team” sports are not really team sports in the sense that what matters is not the total strength of your team, but the strength of your best player. A good example of this is chess: when two people play together, the stronger player can just overrule the weaker player, and so a pair of chess players together is roughly as strong as the stronger player (this isn’t a perfect model, admittedly – for example, two grandmasters against a single grandmaster is probably in favor of the two, but this is definitely true when the abilities of the three players are well separated).

Pool is not like this at all though. When you play with another person, you alternate taking shots, and so playing with someone weaker than you can be a substantial handicap (and similarly for playing with someone much stronger than you). Playing with another person then, amounts to taking some mix of your two abilities. In this post, I try to quantify this mix, and ultimately answer the following question:

Question. Consider three players, {A, B, C}, with their skills quantified as {s_A, s_B, s_C}. What kind of condition can be given on {s_C} (in terms of {s_A, s_B}) to guarantee that {C} has an advantage over {A} and {B} combined?

2. Set-Up

Before we can do anything else, we need to develop a reasonable statistical model of how pool works. The key assumption underlying the entire discussion that follows is that there is a fixed probability that a player makes any given shot. This fixed probability will correspond to the quantified skill {s_A} from before. This probability should somehow be weighted by the frequency of shots e.g. never making a shot that only shows up {10\%} of the time should count less towards this probability than never making a shot that shows up {50\%} of the time. Put another way, this is meant to be an empirical probability: you could estimate it by playing some large number of games, and measuring the proportion of shots you attempted that you made.

Some comments on this assumption: it’s pretty unrealistic. In particular, the game becomes much harder as it progresses, because there are fewer balls available to pocket. Furthermore, it’s not always in your best interest to pocket a ball. These problems can maybe be addressed by introducing some kind of decay factor or something, but this analysis is involved enough with this assumption that I’m wary of the complications that arise by relaxing it.

With this set up in mind, let {X_i} be the random variable that is the number of balls a player pockets on turn {i} (we assume that each of the {X_i} is independent of the others). The number of balls a player pockets in the first {n} turns then, is

\displaystyle S_n=X_1+\cdots+X_n,

and the player will win the game after {N} turns, where

\displaystyle N=\min\{n\mid S_n\geq 8\}

i.e. once you pocket the seven object balls, and then the eight ball. The way to determine the winner of the match is to compare the values of {N} for the player and her opponent, and whoever has the smaller value wins (with equality favoring the player who goes first).

This can also be imagined in the following way: the two players are playing on separate tables, each with eight balls, and counting the number of turns it takes them to clear the table. Whoever does it in fewer turns wins. Underlying this set-up is the assumption of independence between the shots of the two players, which is another unrealistic but unavoidable assumption.

With this set up in mind, here’s the executive summary of how we’ll attempt to answer our question:

  1. Understand the distributions of {X_i, S_n, N} for a single player
  2. Understand the distribution of {X_i, S_n, N} for a two-player team
  3. Use the results of (i) and (ii) to give an answer

3. One Player

Suppose we have a single player with a fixed probability {p} of making a shot. Then, each {X_i} is distributed geometrically, so

\displaystyle \mathop{\mathbb P}(X_i=k)=p^k(1-p),\quad k=0,1,\cdots.

The next thing to do is look at the distribution of {S_n}. Suppose we had {S_n=k}. This would mean that we had attempted {n+k} shots, of which {n} had failed and {k} had succeeded. This argument gives

\displaystyle \mathop{\mathbb P}(S_n=k)=\dbinom{n+k-1}{k}(1-p)^np^k.

Finally, we look at {N}:

\displaystyle \begin{aligned} \mathop{\mathbb P}(N=n)&=\mathop{\mathbb P}(S_{n-1}<8, S_n\geq 8)\\ &=\mathop{\mathbb P}(S_{n-1}<8, X_n>7-S_{n-1})\\ &=\sum_{j=0}^7 \mathop{\mathbb P}(S_{n-1}=j, X_n>7-j)\\ &=\sum_{j=0}^7 \mathop{\mathbb P}(S_{n-1}=j)\mathop{\mathbb P}(X_n>7-j),\\ &=\sum_{j=0}^7\dbinom{n+j-2}{j}(1-p)^{n-1}p^j\sum_{i=8-j}^{\infty} p^i(1-p),\\ &=p^8(1-p)^{n-1}\sum_{j=0}^7\dbinom{n+j-2}{j}. \end{aligned}

This isn’t pretty, but it’s tractable. Shown below are the distributions for {p=0.25, 0.5, 0.75}, to give some sense of what we’re looking at.

This slideshow requires JavaScript.

As a sanity check, these have the correct qualitative behavior: higher values of {p} correspond to fewer shots needed to finish.

4. Two Players

Next, we look at a two-player team. Let their fixed probabilities be {p} and {q}, where the player with probability {p} is the first player, and the player with probability {q} is the second player. Then, on the first players {i^{\text{th}}} turn and the second players {i^{\text{th}}} turn, the number of balls they pocket are given by random variables {X_{2i-1}, Y_{2i}}, where

\displaystyle \mathop{\mathbb P}(X_{2i-1}=k)=p^k(1-p)\quad\text{and}\quad \mathop{\mathbb P}(Y_{2i}=k)=q^k(1-q),\quad k=0,1,\cdots.

(We use two different letters for convenience.)

For the {S_n}, we need to distinguish based on parity: we have

\displaystyle S_{2n-1}=X_1+Y_2+\cdots+Y_{2n-2}+X_{2n-1}\quad\text{and}\quad S_{2n}=X_1+Y_2+\cdots+X_{2n-1}+Y_{2n}.

Computing the distributions of these is a bit trickier this time, but we use a similar method.

Suppose we had {S_{2n-1}=k}. For this to happen, we must have made {2n+k-1} shots, of which {2n-1} are failures and {k} are successes. Of the failures, {n} came from the first player, and {n-1} came from the second player. Furthermore, suppose that {i} of the successes came from the first player, and {k-i} came from the second player. We construct a valid sequence of successes and failures by considering the two players separately, and then intercalating the outcomes of their turns.

From the first player, we have {n+i} shots. The last shot must be a failure, and there are {\binom{n+i-1}{n-1}} to arrange the remaining ones. Similarly for the second players shots, we find that there are {\binom{n+k-i-2}{n-2}} arrangements. Then, for each pair of valid arrangements, we go along the {X_i} till we have a failure; then go along the {Y_i} till we have a failure, and so on, to construct a valid sequence of shots that would give {S_{2n-1}=k}. The probability of any such sequence is {p^iq^{k-i}(1-p)^n(1-q)^{n-1}}.

Using this argument, and allowing {i} to vary, gives

\displaystyle \mathop{\mathbb P}(S_{2n-1}=k)=\sum_{i=0}^k \dbinom{n+i-1}{n-1}\dbinom{n+k-i-2}{n-2}p^iq^{k-i}(1-p)^{n}(1-q)^{n-1}.

Using an analogous argument for the even case, we find

\displaystyle \mathop{\mathbb P}(S_{2n}=k)=\sum_{i=0}^k \dbinom{n+i-1}{n-1}\dbinom{n+k-i-1}{n-1}p^iq^{k-i}(1-p)^{n}(1-q)^{n-1}.

As another sanity check, looking at {S_2} does in fact give the convolution of {X_1} and {Y_2}, so everything’s good.

Remark. When we plug {p=q} into our formula, we recover the results of the single player section. This gives the identity

\displaystyle \sum_{i=0}^k \dbinom{i+n-1}{n-1}\dbinom{n+k-i-2}{n-2}=\dbinom{2n+k-2}{2n-2}.

Final note before moving on: these formulas fail in the case of {k=1}, because our counting argument breaks down. In that case though, we just have {X_1}, whose distribution we already know.

Next, we turn to {N}. Distinguishing cases based on parity and repeating the arguments of the previous section gives

\displaystyle \mathop{\mathbb P}(N=2n-1)=\sum_{j=0}^7 \mathop{\mathbb P}(S_{2n-2}=j)p^{8-j}\quad\text{and}\quad \mathop{\mathbb P}(N=2n)=\sum_{j=0}^7 \mathop{\mathbb P}(S_{2n-1}=j)q^{8-j},

and we could plug in the distribution of {S_n} to get a single expression, but that would take too much space.

At any rate, these expressions are no problem for a computer, and some sample distributions are shown below.

This slideshow requires JavaScript.

These graphs also show the correct qualitative behavior: a team with one player much stronger than the other should be much more likely to finish on turns where the stronger player is up, and taking{p=q=0.5} recovers the distribution from the previous section.

5. Two on One Matches

We can now study a two on one game of pool, and see when the single player has the advantage. Consider three players with skills {p,q,r}, where the first two are playing together against the third. Let {N_{p\oplus q}} be the number of turns taken for the first two to clear their balls (we use {\oplus} to make it clear that we are not just adding the probabilities in the subscript) and let {N_r} be the number of turns taken for the third to clear her balls. For the game to be in the single player’s favor, we need {\mathop{\mathbb E}(N_{p\oplus q})\geq \mathop{\mathbb E}(N_r)}.

Unfortunately, those expectations have no closed form, so this is where we start fudging and approximating.

To get a feel for things, we plot the value of {\mathop{\mathbb E}(N_r)} for different values of {r}.

 

single_expectations.png

This looks vaguely exponential, so let’s do {\log \mathop{\mathbb E}(N_r)} against {r}.

single_expectations_log.png

The logarithmic plot looks suitably linear, and running OLS gives

\displaystyle \widehat{\log \mathop{\mathbb E}(N_r)}=-4.68r+4.69,

with {r^2=0.95}. Of course, based on the graph, we know that this approximation is much worse for {r<0.1}, so if you’re terrible at pool, you should take this with a larger grain of salt (and maybe not play against two people at once).

It makes sense to try the same thing for {N_{p\oplus q}}. Of course, things are three-dimensional now, so patterns are a bit harder to spot. For that reason, we go straight to plotting {\mathop{\mathbb E}(\log N_{p\oplus q})} for varying {(p,q}):

double_expectations.png

This has the same linear (or planar, I suppose) look, and running OLS again gives

\displaystyle \widehat{\log N_{p\oplus q}}=-2.24p-1.9q+4.07,

with {r^2=0.88}. Not quite as good, but still a reasonable approximation (with the same caveat of failure at the edges).

Now, if we replace the condition of

\displaystyle \mathop{\mathbb E}(N_r)\leq \mathop{\mathbb E}(N_{p\oplus q})\quad\text{with}\quad \widehat{\log \mathop{\mathbb E}(N_r)}\leq \widehat{\log\mathop{\mathbb E}(N_{p\oplus q})},

we have

\displaystyle -4.68r+4.69=-2.24p-1.9q+4.07\implies r\geq 0.5p+0.4q+0.1,

where we’ve rounded stuff off in that final result.

So this computation gives the following answer to our question:

Answer: Take a weighted average of skills that is {50\%} of the first player, {40\%} of the second player, and a correction factor of {10\%} of a perfect player (i.e. on who never misses a shot). If you make a higher percentage of your shots than this, you should expect to beat them. \Box

To wrap up, let’s try an example.

Example. Suppose you are facing two players, one of whom makes {90\%} of her shots and another who makes {60\%} of her shots. Since

\displaystyle 0.5\cdot 0.9+0.4\cdot 0.6+0.1\approx 0.8,

you should expect to beat this pair if you can make {80\%} of your shots.

When Expectations Fail

I’ve seen three examples of the same strange thing happening over the past few months, and I wanted to compile them here. The basic set-up is this: in an optimal stopping problem, we usually have some kind of stochastic process, say on a state space {S}, and for each state {x\in S}, we take {f(x)} to be the payoff from stopping at {x} and {u(x)} to be the expected payoff from continuing from {x}. Then, {v(x)}, the “value” of state {x}, should be {v(x)=\max\{u(x), f(x)\}}. Between stopping and continuing, we pick whichever is better. If they’re the same, it doesn’t matter which one we choose.

This formalism is really just a manifestation of the way I think most people think about the kinds of optimal stopping problems they meet in the real world: am I better off stopping or not? But the weird thing I’ve seen is that this strategy can actually give completely counterintuitive and incorrect results. The first time I saw this, I brushed it off as a weird edge case, but then I saw two more such examples.

This is weirdly unsettling, because expected value is (maybe implicitly) held up as this great tool for thinking about how to play little games like this and make decisions, but it clearly has some serious drawbacks that I hadn’t been fully aware of until seeing these.

1. Three Examples

Example 1. Consider the secretary problem (this is actually what’s called the cardinal payoff variant of the problem), where we see some sequence of {n} applicants for a position, each applicant has an associated value measuring their quality, and our payoff is this quality value. We must decide whether to accept/reject an applicant immediately after seeing them, and cannot call them back.

If the qualities of the applicants are drawn from a uniform distribution, then it can be shown that the best strategy is to go through {\sqrt{n}} candidates, and then pick the next best one. But if the qualities are drawn from the negative half of the Cauchy distribution, that is, with PDF

\displaystyle P(\text{Quality}=x)=\frac{2}{\pi}\cdot \frac{1}{1+x^2},\quad x\leq 0,

then strange things happen. In particular, we can check that

\displaystyle E(\text{Quality})=\frac{2}{\pi}\int_{-\infty}^0 \frac{x}{1+x^2}\, dx=-\infty,

so the expected quality of any applicant is negative infinity! What this means is that when we see the first applicant, the expected payoff from continuing is negative infinity, so we always accept the first candidate we see. But the payoff from using this strategy is the expected quality of the first candidate, which is still negative infinity!

Example 2. Consider a simple random walk on {{\mathbb Z}^+} starting at some {N>0} with an absorbing boundary i.e. once you reach zero, you stop walking. If we have the payoff function {f(n)=n^2} for stopping at {n}, then

\displaystyle E(f(X_{n+1})\mid X_n=k)=\frac{1}{2}(k+1)^2+\frac{1}{2}(k-1)^2=k^2+1>f(X_n),

so the expected payoff from continuing is always greater than the payoff from stopping. But the walk is recurrent, so if we continue walking for long enough, we are guaranteed to reach 0 and win nothing.

Example 3. Consider a series of double-or-nothing bets on the results of a flip of a biased coin that comes up heads with probability {p>\frac{1}{2}}. The expected value of taking the bet when you have {N} dollars is

\displaystyle p(2N)+(1-p)0=2pN>N,

so we should always take the bet. But the coin will eventually come up tails with probability 1, and we will lose everything.

(Credits: I saw the first example in a Facebook post by Kevin Dong, the second on a problem set for my stochastic processes class, and the third in a conversation with a friend.)

What’s going on? How did this seemingly foolproof plan of just doing the thing that maximized our expectation go so awry?

2. The Secretary Problem

For the secretary problem, the apparent issue is that having infinite expectations just gives meaningless results. It’s not too bad when you’re trying to compare an infinite and finite expectation, because then you can at least say something like “we should do anything to avoid something with infinitely negative expected value,” but the example here is trying to compare two infinite expectations (the expectation from stopping after one candidate and the expectation from stopping after two candidates), and it all falls apart. The only consolation is that this expected value will never be attained – the first candidate can only be finitely bad, after all.

3. The Random Walk

For the random walk, I think a small computation is helpful. Let’s say we adopt a strategy where we a set two markers, say at {a<N<b}, and stop whenever we reach them. To compute the expected payoff of such a strategy, let {X_1,\cdots, X_n} be i.i.d. Bernoulli random variables with parameter {p=1/2}. Then our simple random walk is given by the recurrence {S_n=S_{n-1}+X_n} with {S_0=N}.

We can check that if {\mathcal{F}_n} is the {\sigma}-algebra generated by {X_1,\cdots, X_n}, then {S_n} is measurable with respect to {\mathcal{F}_n}, and furthermore,

\displaystyle E(S_{n+1}\mid \mathcal{F}_n)=E(S_{n}+X_{n+1}\mid \mathcal{F}_n)=E(S_n\mid \mathcal{F}_n)+E(X_{n+1}\mid \mathcal{F}_n)=S_n,

so {S_n} is a martingale with respect to {\mathcal{F}_n}. Now, if we let

\displaystyle T=\min\{n\mid S_n=a\text{ or }S_n=b\},

the optimal stopping theorem gives us {E(S_T)=E(S_0)=N}. Then,

\displaystyle P(T=a)a+(1-P(T=a))b=N\implies P(T=a)=\frac{N-b}{a-b}, P(T=b)=\frac{a-N}{a-b}.

With this in mind, our expected payoff is

\displaystyle E(a,b)=a^2\frac{N-b}{a-b}+b^2\frac{a-N}{a-b}=\frac{(a^2-b^2)N+ab(b-a)}{a-b}=N(a+b)-ab.

The issue is a little clearer now: {E(a,b)} is linear in both {a,b}, so it’s extrema are at the endpoints – that is, we can always do better by making {a} smaller or {b} larger. This corresponds exactly to the “we’re always better off taking another step” mentality that the one step expectation calculation gave, except that now we see that this problem isn’t just an artifact of only looking one step ahead.

I think what’s going wrong here is that the expectation isn’t accounting for the opportunity cost involved. When we take {f(X_{n+1}\mid X_n=1)}, the fact that state 0 has zero payoff doesn’t sufficiently capture the fact that we’ll also lose the opportunity to continue playing. This can also be thought of as a local/global type issue, where the thing that’s ruining everything is the recurrence of the walk, which is a global property, but our expected value computations are all local.

4. The Coin Game

It makes sense to do a computation similar to the previous one, but the computation is much easier here: if we decide that we’re going to stop after {k} flips, the expected payoff is

\displaystyle p^k\cdot 2^k+(1-p^k)\cdot 0=(2p)^k.

This has the same qualitative behavior as the payoff function from the random walk, where it increases as we play for longer, and we can explain it in the same way, by saying that the expected value doesn’t realize that the game ends when we flip tails. But this behavior is arising from a different kind of payoff function, so there’s an underlying structural difference between the two. I’d be curious to know if there are other examples of games that have such payoff functions, or if these are in some sense the only two ways this can happen.

Avoiding Contour Integration

One time I boasted, “I can do by other methods any integral anybody else needs contour integration to do.” So Paul puts up this tremendous damn integral he had obtained by starting out with a complex function that he knew the answer to, taking out the real part of it and leaving only the complex part. He had unwrapped it so it was only possible by contour integration! He was always deflating me like that. He was a very smart fellow.

— Richard Feynman

Partially inspired by this excerpt, I’ve had several conversations in the past few months about doing hard integrals without the residue theorem. I think there’s a lot of appeal to avoiding such a high-powered tool, so I both like finding solutions that don’t use it and believe that it’s usually possible to do so. I got stuck though, on

\displaystyle \int_0^{\infty} \frac{1}{1+z^n}\, dz,

and because this integral is basically equivalent to the reflection formula for the {\Gamma} function, I thought it would absolutely require some kind of complex analysis to tackle.

I was wrong. The following solution is mostly due to Sameer Kailasa, and evaluates the integral without any complex analysis. It does however, require some pretty technical manipulations, and so to make life easier, I’m going to ignore all issues of convergence/interchanging limits in this post. The fact that the answer works out correctly means that these issues are actually just technical ones, that can be resolved by appropriately applying dominated convergence/Fubini.

We start by evaluating a particular sum. I know of at least one other way to do this, which involves showing that the RHS and LHS share four properties, and then using these to argue that they agree everywhere, but I like this way better, because the tricks used here is more widely applicable than the trick used in the other argument (in the interest of full disclosure, this means that I know of one other instance where this trick is useful, and no other instances where the other trick is useful).

Proposition 1. We have

\displaystyle \sum_{k=1}^{\infty} \frac{1}{x^2-k^2} =\frac{\pi x\cot \pi x -1}{2x^2}.

Proof: The Weirstrass factorization theorem allows us to write

\displaystyle \frac{\sin x}{x} =\left(1-\frac{x}{\pi}\right)\left(1+\frac{x}{\pi}\right)\left(1-\frac{x}{2\pi}\right)\left(1+\frac{x}{2\pi}\right)\cdots =\prod_{k=1}^{\infty} \left(1-\frac{x^2}{k\pi^2}\right).

Changing variables by {x\mapsto \pi x} gives

\displaystyle \frac{\sin \pi x}{\pi x}=\prod_{k=1}^{\infty} \left(1-\frac{x^2}{k^2}\right)\implies \ln(\sin \pi x)-\ln \pi x=\sum_{k=1}^{\infty} \ln\left(1-\frac{x^2}{k^2}\right),

and we differentiate to obtain

\displaystyle \pi \cot \pi x -\frac{1}{x} =\sum_{k=1}^{\infty} \frac{2x}{x^2-k^2}\implies \sum_{k=1}^{\infty} \frac{1}{x^2-k^2} =\frac{\pi x\cot \pi x -1}{2x^2}.

\Box

Remark The the factorization theorem is technically a complex analytic tool, but I’m willing to use it here because a. it’s really only a technical crutch, not the crux of the argument and b. there’s no sense (to the best of my knowledge) in which its equivalent to the residue theorem/Cauchy’s theorem/anything that does integrals. Also, it can be avoided entirely by using the other argument I mentioned above.

Corollary 2. We have

\displaystyle \sum_{k=1}^{\infty} \frac{(-1)^k}{x^2-k^2}=\frac{\pi x \csc \pi x-1}{2x^2}.

Proof: We write

\displaystyle \sum_{k=1}^{\infty} \frac{(-1)^k}{x^2-k^2}=\sum_{k\text{ even}}\frac{1}{x^2-k^2}-\sum_{k\text{ odd}} \frac{1}{x^2-k^2}=2\sum_{k\text{ even}} \frac{1}{x^2-k^2}-\sum_{k=1}^{\infty}\frac{1}{x^2-k^2}.

The second sum was evaluated in Proposition 1; the first is

\displaystyle \sum_{k\text{ even}}\frac{1}{x^2-k^2}=\sum_{k=1}^{\infty}\frac{1}{x^2-4k^2}=\frac{1}{4}\sum_{k=1}^{\infty} \frac{1}{(x/2)^2-k^2},

and now we can apply Proposition 1 to this sum as well. The result is

\displaystyle 2\cdot \frac{1}{4}\cdot \frac{\pi\frac{x}{2}\cot\frac{\pi x}{2}-1}{2\cdot \frac{x^2}{4}}-\left(\frac{\pi x\cot \pi x-1}{2x^2}\right)=\frac{\pi x\left(\cot \frac{\pi x}{2}-\cot \pi x\right)-1}{2x^2}.

Noting that

\displaystyle \cot \frac{x}{2}-\cot x=\frac{1+\cos x}{\sin x}-\frac{\cos x}{\sin x}=\csc x,

allows us to finish. \Box

Now we’re ready to tackle integral.

Proposition 3. We have

\displaystyle \int_0^{\infty} \frac{1}{1+z^n}\, dz=\frac{\pi}{n}\csc \frac{\pi}{n}.

Proof: We start by writing

\displaystyle \int_0^{\infty} \frac{1}{1+z^n}\, dz=\int_0^{1} \frac{1}{1+z^n}\, dz+\int_1^{\infty} \frac{1}{1+z^n}\, dz,

and substituting {u=1/z} in the second integral to give

\displaystyle \int_0^{\infty} \frac{1}{1+z^n}\, dz=\int_0^{1} \frac{1}{1+z^n}\, dz+\int_0^{1} \frac{z^{n-2}}{1+z^n}\, dz=\int_0^1\frac{1+z^{n-2}}{1+z^n}\, dz.

Since we’re only dealing with {0<z<1} now, we expand the denominator as a geometric series, so

\displaystyle \int_0^1\frac{1+z^{n-2}}{1+z^n}\, dz=\int_0^1 (1+z^{n-2})\sum_{k=0}^{\infty} (-1)^kz^{nk}\,dz =\sum_{k=0}^{\infty} (-1)^k\int_0^1 z^{nk}+z^{n(k+1)-2}\,dz,

and evaluating the integral gives

\displaystyle \sum_{k=0}^{\infty} (-1)^k\left(\frac{1}{nk+1}+\frac{1}{n(k+1)-1}\right)=\sum_{k=0}^{\infty} \frac{(-1)^k}{nk+1}+\sum_{k=0}^{\infty} \frac{(-1)^k}{n(k+1)-1}.

Now we tweak things a bit to get

\displaystyle 1+\sum_{k=1}^{\infty}\frac{(-1)^k}{nk+1}-\sum_{k=1}^{\infty}\frac{(-1)^k}{nk-1}=1-2\sum_{k=1}^{\infty}\frac{(-1)^k}{n^2k^2-1}=1+\frac{2}{n^2}\sum_{k=1}^{\infty} \frac{(-1)^k}{\frac{1}{n^2}-k^2}.

Plugging in the result of Corollary 2 with {x=1/n} now gives

\displaystyle 1+\frac{2}{n^2}\cdot \frac{\pi \cdot \frac{1}{n}\csc\frac{\pi}{n}-1}{2\cdot \frac{1}{n^2}}=\frac{\pi}{n}\csc \frac{\pi}{n},

as desired.

\Box

Proving Fubini’s Theorem

We skipped this in my analysis class, so I’m going to prove Fubini’s theorem in this post. I’m following the proof from Stein and Shakarchi’s Real Analysis with some restructuring. The basic result is that you can compute integrals in {{\mathbb R}^{m+n}} by first integrating in {{\mathbb R}^m}, then in {{\mathbb R}^n}, and the order in which you do these two integrals doesn’t matter.

To be precise, we can think of points in {{\mathbb R}^{m+n}} as {(x,y)}, where {x\in {\mathbb R}^m} and {y\in {\mathbb R}^n}. Then, for each {y\in{\mathbb R}^n}, we can define the “slices” of a function {f} and a set {E} corresponding to {y} as

\displaystyle f^y(x): {\mathbb R}^m\rightarrow {\mathbb R}, f^y(x)=f(x,y)\quad\text{and}\quad E^y=\{x\in {\mathbb R}^m\mid (x,y)\in E\}.

(Of course, we could define slices corresponding to {x\in {\mathbb R}^m} as well, but it’s sufficient to do everything for {y}-slices, since the proofs for {x}-slices are completely analogous.)

Theorem 1 (Fubini’s Theorem). Let {\mathcal{F}} be the set of functions {f} on {{\mathbb R}^{m+n}} satisfying:

  1. The slice {f^y} is integrable on {{\mathbb R}^m} for almost every {y\in {\mathbb R}^n}
  2. The function {\displaystyle{\int_{{\mathbb R}^m} f^y(x)\, dx}} is integrable on {{\mathbb R}^n}
  3. {\displaystyle{\int_{{\mathbb R}^n}\left(\int_{{\mathbb R}^m} f(x,y)\, dx\right)\, dy=\int_{{\mathbb R}^{m+n}} f(x,y)}}

Then {L^1({\mathbb R}^{m+n})\subset \mathcal{F}} i.e. every integrable function is in {\mathcal{F}}.

The idea of the proof is to show that {\mathcal{F}} contains the simple functions, and then approximate {L^1} functions by simple functions. This approach requires two facts: that {\mathcal{F}} is closed under linear combinations, and that {\mathcal{F}} is closed under monotone limits. We start with these two proofs.

Lemma 2. A finite linear combination of functions in {\mathcal{F}} is in {\mathcal{F}}.

Proof: Suppose we have {\{f_n\} \subset \mathcal{F}}. Then for each {k}, {f_k^y} is integrable on {{\mathbb R}^n} except for {y\in A_k}, where {\mu(A_k)=0}. If we let {A=\cup A_k}, then {\mu(A)=0}, and if {g=\sum a_k f_k}, then {g^y} will be integrable except when {y\in A}. It follows that {g} satisfies 1 and 2, and then the linearity of the integral gives 3, so {g\in \mathcal{F}}. \Box

Lemma 3. A monotone limit of functions in {\mathcal{F}} is in {\mathcal{F}}.

Proof: Suppose we have {\{f_n\} \subset \mathcal{F}}. We can make two simplifying assumptions: that the convergence is monotonically increasing, and that the {f_n} are non-negative. If either of these is not the case, then following argument can be applied to {-f_n} and {f_n-f_1} to give the result.

The monotone convergence theorem tells us that

\displaystyle \lim_{n\rightarrow \infty} \int_{{\mathbb R}^{m+n}} f_n(x,y)\, dx\, dy=\int_{{\mathbb R}^{m+n}} f(x,y)\, dx\, dy.

We know that for every {k}, {f_k^y} is integrable except for {y\in A_k} where {\mu(A_k)=0}, and if we let {A=\cup A_k}, then {\mu(A)=0}, and {f_k^y} is integrable on {{\mathbb R}^m} for all {k} except for {y\in A}. Using the monotone convergence theorem again gives

\displaystyle \underbrace{\int_{{\mathbb R}^m} f_k^y(x)\, dx}_{g_k(y)} \nearrow \underbrace{\int_{{\mathbb R}^{m}} f^y(x)\, dx}_{g(y)},

as {k\rightarrow \infty}.

A third application of monotone convergence gives that

\displaystyle \int_{{\mathbb R}^n} g_k(y)\, dy\xrightarrow{k\rightarrow\infty} \int_{{\mathbb R}^n} g(y)\, dy.

Now, because {f_k\in \mathcal{F}}, we know that

\displaystyle \int_{{\mathbb R}^n} g_k(y)\, dy=\int_{{\mathbb R}^{m+n}} f_k(x,y)\, dx\, dy,

and this together with our previous work gives that

\displaystyle \int_{{\mathbb R}^n} g(y)\, dy=\int_{{\mathbb R}^{m+n}} f(x,y)\, dx\, dy.

The RHS is finite because {f} is integrable, so {g} is integrable as well. Thus {g(y)<\infty} almost everywhere, and so {f^y} is integrable for almost every {y}, and

\displaystyle \int_{{\mathbb R}^n}\left(\int_{{\mathbb R}^m} f(x,y)\, dx\right)\, dy=\int_{{\mathbb R}^{m+n}} f(x,y).

It follows that {f\in \mathcal{F}}. \Box

Next, we show that {\mathcal{F}} contains the simple functions. The idea here is to start with the characteristic functions of {G_{\delta}} sets and null sets, and then use our previous work to get to simple functions.

Lemma 4. If {f} is a simple function, then {f\in\mathcal{F}}.

Proof: We build this up in steps.

  1. If {E} is a {G_{\delta}} set of finite measure, then {\chi_{E}\in \mathcal{F}}.
    1. If {E} is a bounded open cube, then {\chi_E\in\mathcal{F}}.

      Let {E=Q_1\times Q_2} be such an open cube, where {Q_1, Q_2} are cubes in {{\mathbb R}^m} and {{\mathbb R}^n} respectively. It’s clear that {\chi_E} satisfies 1 and 2, so we need to do the computation to verify 3.

      For each {y}, the function {\chi_E(x,y)} is measurable in {x} and integrable with

      \displaystyle g(y)=\int_{{\mathbb R}^m} \chi_E(x,y)\, dx=\begin{cases} |Q_1|\quad \text{if }y\in Q_1,\\ 0\quad \text{otherwise}\end{cases}.

      Thus {g=|Q_1|\chi_{Q_2}} is also measurable and integrable, and

      \displaystyle \int_{{\mathbb R}^n} g(y)\, dy= |Q_1||Q_2|.

      But of course,

      \displaystyle \int_{{\mathbb R}^{m+n}} \chi_E(x,y)\, dx\, dy=|E|=|Q_1||Q_2|,

      so {\chi_E\in \mathcal{F}}.

    2. If {E} is a subset of the boundary of a closed cube, then {\chi_E\in\mathcal{F}}.

      Let this set be {E}. It’s still easy to check 1 and 2, so we verify 3. Because the boundary of the cube has measure zero in {{\mathbb R}^d}, we always have

      \displaystyle \int_{{\mathbb R}^{m+n}} \chi_E(x,y)\, dx\, dy=0.

      On the other hand, for almost every {y}, the slice {E^y} has measure zero in {{\mathbb R}^m}, and so the iterated integral is also zero. It follows that {\chi_E\in \mathcal{F}}.

    3. If {E} is the finite union of closed cubes with disjoint interior, then {\chi_E\in\mathcal{F}}.

      The idea is to write {\chi_E=\sum \chi_{A_k}}, where each {A_k} is either the interior of an open cube, or a subset of the boundary of a closed cube. From 1.1 and 1.2, each {\chi_{A_k}\in\mathcal{F}}, and now Lemma 2 gives that {\chi_E\in\mathcal{F}}.

    4. If {E} is open and of finite measure, then {\chi_E\in\mathcal{F}}.

      Such an open set {E} can be written as the countable union of almost disjoint (i.e. their intersection is null) closed cubes {Q_j}, so we define

      \displaystyle E=\bigcup_{j=1}^{\infty} Q_j\quad\text{and}\quad f_k=\sum_{j=1}^k \chi_{Q_j}.

      Then, {f_k\nearrow \chi_E}, which is integrable because {E} has finite measure, and Lemma 3 allows us to conclude that {\chi_E \in \mathcal{F}}.

    5. If {E} is {G_{\delta}} and of finite measure, then {\chi_E\in\mathcal{F}}.

      Because {E} is {G_{\delta}}, there are open sets {G_i} such that

      \displaystyle E=\bigcap_{n=1}^{\infty} G_n

      and because {E} is measurable there is an open set {G\supset E}. Then, let

      \displaystyle \tilde{G}_k=G\cap \bigcap_{n=1}^{k} G_n,

      and we have a decreasing sequence of open sets

      \displaystyle \tilde{G}_1\supset \tilde{G}_2\supset\cdots,\quad\text{and}\quad E=\bigcap_{k=1}^{\infty} \tilde{G}_k.

      Now, {\chi_{\tilde{G}_k}\searrow \chi_E,} and by 1.4, {\chi_{\tilde{G}_k}\in\mathcal{F}}, so Lemma 3 shows that {\chi_{E}\in \mathcal{F}}.

  2. If {\mu(E)=0}, then {\chi_E\in\mathcal{F}}.

    Because a null set {E} is measurable, we can find a {G_{\delta}} set {G\supset E} such that {\mu(G)=0}. From Step 1, we know that {\chi_G\in \mathcal{F}}, so

    \displaystyle \int_{{\mathbb R}^n}\left(\int_{{\mathbb R}^m} \chi_G(x,y)\, dx\right)\, dy=\int_{{\mathbb R}^{m+n}} \chi_G=0\implies \int_{{\mathbb R}^m} \chi_G(x,y)=0,

    for almost every {y}. This means that {\mu(G^y)=0} and {E^y\subset G^y}, so we have that {\mu(E^y)=0} for almost every {y}, and so

    \displaystyle \int_{{\mathbb R}^n}\left(\int_{{\mathbb R}^m} \chi_E(x,y)\, dx\right)\, dy =0\int_{{\mathbb R}^d}\chi_E\implies \chi_E\in\mathcal{F}.

  3. If {E} is measurable with finite measure, then {\chi_E\in\mathcal{F}}.

    Recall that we can find a {G_{\delta}} set {G\supset E} such that {\mu(G\setminus E)=0}, and now writing {\chi_E=\chi_G-\chi_{G\setminus E}} and Lemma 2 gives that {\chi_E\in\mathcal{F}}.

Now, any simple function {f} is a linear combination of characteristic functions of measurable sets with finite measure, so Lemma 2 shows that {f\in\mathcal{F}}. \Box

With these three lemmas in hand, we can prove Fubini’s theorem.

Proof of Fubini’s Theorem: Suppose {f} is an integrable function. We can write {f} as the sum of a positive and negative part, so it is sufficient by Lemma 2 to consider the case where {f} is non-negative. Because {f} is integrable, there are simple functions {f_k} that converge monotonically to {f}. By Lemma 4, each {f_k\in\mathcal{F}}, and now Lemma 3 allows us to conclude that {f\in\mathcal{F}}. \Box