Q&A with the NYT Crossword Archive

I’ve been doing a lot of crosswords lately, so I thought I’d spend some time playing around with old crosswords to see if there was anything interesting to say about them. This whole post is structured as a series of questions I or other people asked, which I then answered using data from this repository. If there’s anything not covered here you’d like to know about, feel free to get in touch with me!

Just for reference, there are 1229405 clue/answer pairs, taken from 14524 puzzles. After throwing out duplicates, there are 140628 distinct answers, and 696050 distinct clues.

1. The Words

Question. What are the most common words? 
It turns out that these are almost exclusively short, vowel-rich words, with the top 5 being:

Word Occurrences
AREA 894
ERA 880
ERIE 818
ERE 750
ALOE 723

These aren’t especially interesting, so I then filtered by length i.e. found the most common word of each length. They’re a bit unwieldy to format here, but you can see them all here. The smaller ones are still boring, but the longer ones are cute, and the sudden uptick in frequencies for the final block is especially helpful: there just aren’t that many very long words appearing in crosswords, it seems.

Question. What is the longest word? 
This turns out to be a bit tricky to answer: doing the obvious sorting gives the answer of:

“TENNINEEIGHTSEVENSIXFIVEFOURTHREETWOONEITSTWOOOTWO,”

which is from the 12/30/2001 puzzle, which I couldn’t find a copy of. The second longest is the theme answer from this puzzle, which makes it clear that the previous one was probably a rebus too.

If you don’t count a rebus answer (which is reasonable), then there’s a whole slew of 23 letter words coming from a time when the Sunday was 23 by 23. If you don’t want to count those either, then you’re left with the pile of 21 letter words that span modern Sunday grids.

Question. What word has the most distinct clues?
The answer here turns out to be for the most part, the same as the common words:

Word Distinct Clues
ERA 462
ERIE 439
ONE 432
ARIA 424
ORE 405

Just to take a peek under the hood, all of the distinct clues for ERA can be seen here, and they seem to fall into three categories:

  • ERA like the time period
  • ERA like the baseball statistic
  • ERA like the Equal Rights Amendment

So, on a spiritual level, there’s only 3 clues for ERA. On the other hand, ONE actually has many meaningfully different clues, all of which are listed here. I wish there was some way to capture this sense of “spiritually different clues” and then find the answer with the most “spiritually diverse” clues, but I can’t think of a good algorithmic way to do it.

Question. What words have very few distinct clues?
I arbitrarily chose to look at words that are in the 10000 most common words but have fewer than 5 clues. The 5 most common (although there are only 9 in total) are:

Word Occurrences Clues
ORAD 60 4
AVISO 38 4
BRAC 32 2
HADON 31 4
OEN 29 4

The occurrences column basically shows that for a word to have fewer than 5 distinct clues, it really can’t appear all that often. If it does, it ends up having at least technically diverse clues, like those for ERA.

Question. How has the popularity of words changed over time?
I plotted this for 10 common words – there don’t seem to be any appreciable trends.

crosswords_over_time

Question. How does the distribution of letters in crossword answers compare to in normal English?
One graph shows it all:
diffs

Some of these make sense: A and E are over-represented because it’s easy to fill in grids with them. Others make no sense at all: why is H so under-represented?

Question. What are the hardest words?
Well, first you have to come up with some way to measure difficulty: I assigned values of 1, 2, 3, 4, 5, 6, 4 to the days of the week in order, keeping with the conventional wisdom that Sunday is about as hard as Thursday. Here are the 5 words from the 10,000 most common words that had the highest average value:

Word Occurrences
ONLEAVE 4.97
ALACARTE 4.93
OPALINE 4.85
ETAMINE 4.79
ONEONEONE 4.78

2. The Clues

Question. What are the most common clues?
No real pattern to these:

Clue Occurrences
Concerning 317
Jai ___ 256
Greek letter 250
Chemical suffix 225
Gaelic 216

It’s interesting to look at the answers these clues point to: “Concerning” has a handful, “Jai ___” has only one (ALAI), “Greek letter” and “Chemical suffix” both have a decent mix again, and “Gaelic” is only ever a clue for ERSE.

Question. What are clues that have many distinct answers? 
I’m not going to give a list of the top results here, because many of them are things like “See 17-Across” or “Theme of this puzzle,” which are not very interesting. Two results that are in the right vein though, are “Split” and “Nonsense”, which have 47 and 40 possible answers respectively, that are listed out here and here.

Question. What are clues that have only one answer? 
There are tons of these. A small sample:

Clue Occurrences
___-majesté 143
___ avis 132
Dies ___ 132
Sri ___ 102
__-Magnon 98

Note that I’m omitting “Jai ___” and “Gaelic” for variety’s sake. As suggested by this sample, nearly all of the one-answer clues are fill in the blanks, which makes sense.

3. The Days

Question. How does the length of words vary over the week?
Not much, as it turns out:

length_by_day.png

Question. How does the frequency of puns vary over the week?
A lot, as it turns out:

puns_days

Question. How does the density of vowels vary over the week? 
And we’re back to not much:

vowels_days.pngSo it seems like puns best capture the difficulty gradient of the week.

4. The Years

Question. How does the length of words vary over the years?
In strange and inconsistent ways:

year_length.png

One fact that may be relevant: Will Shortz began editing the puzzle in 1993. I’m not sure if this explains the jump, but it’s certainly a compelling story.

Question. How does the frequency of puns vary over the years?
People love them more every year:

puns_years.png

It’s interesting to note that there’s still a spike in the mid-90s, but this spike comes 2-3 years after Shortz became editor.

Question. How does the density of vowels vary over the years? 
Basically the opposite of the puns:

vowels_years.png Again, the huge drop occurs when Shortz takes over.

Memory in Speech Processing

This very short post is just to jot down an epiphany I had about a connection between two different approaches used in speech processing.

One popular kind of feature extracted from raw speech data is an MFCC. I don’t want to dwell too much on what exactly this is, but it basically involves taking two Fourier transforms (so it’s the spectrum of a spectrum), which works well, but is completely mystifying. The first Fourier transform makes sense, since sound waves are a periodic phenomenon, but the second one makes much less sense.

Another group of approaches that are growing in popularity are recurrent neural networks, which are basically networks that have some notion of “memory” built in to them. This means that inputs seen at previous times can be used for prediction for a present input, although details vary a lot depending on the particular implementation.

The punchline is that, on some level, these are doing similar things. Or rather, the MFCC is also relying on the idea that data from previous time points are influencing the current data point.

To see this, suppose our data are {x_1, x_2,\cdots,} and that they follow a model of the form

\displaystyle x_t=s_t+As_{t-D},

for some other stationary series {s_t} and constants {A, D}. The recurrent neural network would try to learn that {x_{t-D}} carries information about {x_t} (since it is designed to have access to {x_{t-D}}. What about the MFCC?

It’s helpful to directly compute the spectrum, starting with the auto-covariance. If {\gamma_s(h)=\text{cov}(s_{t+h}, s_t)} is the auto-covariance of {s_t}, then the auto-covariance of {x_t} is

\displaystyle \gamma_x(h)=\text{cov}(x_{t+h}, x_t)=\text{cov}(s_{t+h}+As_{t-D+h}, s_t+As_{t-D})

which, because the covariance is bilinear, becomes

\displaystyle (1+A^2)\gamma_s(h)+A\gamma_s(h+D)+A\gamma_s(h-D).

Then the theoretical spectrum is

\displaystyle f_x(\omega)=\sum_{h=-\infty}^{\infty} \gamma_x(h)e^{-2\pi i \omega},

which because of our earlier computation expands to

\displaystyle \sum_{h=-\infty}^{\infty}((1+A^2)\gamma_s(h)+A\gamma_s(h+D)+A\gamma_s(h-D)) e^{-2\pi i \omega h}.

This can be rewritten as

\displaystyle (1+A^2)f_s(\omega)+Ae^{2\pi i \omega D}\sum_{h=-\infty}^{\infty} \gamma_s(h+D)e^{-2\pi i \omega (h+D)}+Ae^{-2\pi i \omega D}\sum_{h=-\infty}^{\infty} \gamma_s(h-D)e^{-2\pi i \omega (h-D)},

and then simplified to

\displaystyle (1+A^2)f_s(\omega)+Af_s(\omega)(e^{2\pi i \omega D}+e^{-2\pi i \omega D})=(1+A^2+2A\cos (2\pi \omega D))f_s(\omega).

So the spectrum actually has a periodic component, and taking a second Fourier transform means sussing out this periodicity.

This means then, that MFCCs and recurrent neural networks are actually making similar assumptions about the data, or at least trying to exploit the same kind of structure. I like this similarity, especially because it suggests that thinking about the effect of previously seen data points is the “right” way to think about speech data.

Multiple Testing Procedures

Here’s something problematic: let’s say you run a hypothesis test at a significance level of {\alpha=0.05}. Then, assuming you ran the test correctly and met assumptions, the probability of a type I error is only 5%. But if you instead run 100 tests at a significance level of {\alpha=0.05}, the probability of making at least one type I error soars to

\displaystyle 1-(1-0.05)^{100}\approx 99\%,

and this only gets worse as you run increasingly many tests. This is the multiple testing problem, and this blog post discusses a couple of solutions to it as well as why they work. Most of this is taken from my notes for UChicago’s STAT 278, which I’d highly recommend to anyone interested in these sorts of things.

The set up for the rest of this post is as follows. We have p-values {p_1,\cdots, p_n} that we’ve acquired from some hypothesis tests. Some of these come from true null hypotheses, and are distributed uniformly on {[0,1]}. The rest come from false null hypotheses, and we know nothing about their distribution.

We want to run a test that will have a significance level of {\alpha}, not for each test individually, but for all of them together. This will turn out to mean different things for different methods.

1. Bonferroni Correction

This is a method for testing the global null hypothesis, which is the hypothesis that all of the null hypotheses that generated our p-values are true (so we take the intersection of the relevant events). We do this in the most naive way possible, and it turns out to work.

We just test each of the {p_i} at a significance level of {\alpha/n}, and reject the global null if any of the {p_i} are rejected. That is, we reject the global null if {p_i\leq \alpha/n} for some {i} and fail to reject otherwise.

In what sense does this work?

Proposition 1. When testing the global null hypothesis using Bonferroni correction, the probability of making a type I error is {\leq\alpha}.

Proof: This is direct. The probability that we reject the global null is

\displaystyle \mathop{\mathbb P}\left(\min_{1\leq i\leq n} p_i\leq \frac{\alpha}{n}\right)=\mathop{\mathbb P}\left(\bigcup_{i=1}^n \left\{p_i\leq \frac{\alpha}{n}\right\}\right),

which by the union bound is

\displaystyle \leq \sum_{i=1}^n\mathop{\mathbb P}\left(p_i\leq \frac{\alpha}{n}\right)=n\cdot \frac{\alpha}{n}=\alpha,

so we have FWER control. \Box

So that’s nice. But it’s also extremely conservative, and we can afford to do things that are a bit more powerful.

2. Simes’ Test

Here we’re also testing the global null, but a bit more cleverly. The Simes’ test rejects the global null if there is 1 p-value that is {\leq \frac{\alpha}{n}}, or 2 p-values that are {\leq \frac{2\alpha}{n}}, or in general, if there are {k} p-values that are {\leq \frac{k\alpha}{n}} for any {1\leq k\leq n}. This will reject when Bonferroni rejects, but also in some other cases, so its more powerful. This power costs something, though: we need an assumption of independence to get a bound, and this bound isn’t as strong as for Bonferroni.

Proposition 2. If the p-values are independent, then when testing the global null hypothesis using Simes’ test, the probability of making a type I error is {\leq\alpha H_n\approx \alpha \log n}.

Proof: We assign a score to each {p}-value that measures what percent of the way it gets us to rejecting the global null. Let

\displaystyle s_i=\left\{\begin{aligned} 1&\text{ if }&& p_i\leq \frac{\alpha}{n}\\ \frac{1}{2}&\text{ if }&& \frac{\alpha}{n}\leq p_i\leq \frac{2\alpha}{n}\\ &\vdots \\ \frac{1}{n}&\text{ if }&& \frac{\alpha(n-1)}{n}\leq p_i\leq {\alpha}\\ \\ 0&\text{ if } &&p_i>\alpha\end{aligned}\right.

Then, Simes rejects if and only if, for some {k}, there are {k} {p_i}‘s that are {\leq \frac{k\alpha }{n}}, which means that the sum of the {s_i} is {\geq 1}. So now we can calculate the probability of that happening.

The expected value of any given score is

\displaystyle \mathop{\mathbb E}[s_i]=1\cdot \frac{\alpha}{n}+\cdots +\frac{1}{n}\cdot \frac{\alpha}{n}=\frac{\alpha}{n}H_n.

Then,

\displaystyle \mathop{\mathbb P}(\text{Simes rejects})\leq \mathop{\mathbb P}\left(\sum_{i=1}^m s_i\geq 1\right)\leq \mathop{\mathbb E}\left[\sum_{i=1}^n s_i\right] =n\cdot \frac{\alpha}{n}H_n =\alpha H_n.

\Box

Note that for large {n}, the {\log n} factor counts a lot, and the bound is sharp. The equality construction is a bit of a pain, but the idea is to choose the joint distribution of the {p_i} to make them negatively dependent.

3. Holm-Bonferroni

This test is different from the previous two in that it does not test the global null, but instead individually tests each of the {p_i} and then controls the family-wise error rate (FWER), which is the probability of making at least one type I error across all the p-values. As an added bonus, we can accept/reject individual p-values rather than accepting/rejecting them all at once, so we have a much better sense of how significant the results are, taken together.

The idea of Holm-Bonferroni is to run Bonferroni until acceptance. We first run Bonferroni on {p_1,\cdots, p_n}. If we do not reject the global null, we stop. If we do, then we reject the smallest p-value, and repeat this procedure with the other p-values, continuing until we do not reject the global null.

Equivalently, if we order the p-values as {p_{(1)},\cdots, p_{(n)}}, this is the same as checking if

\displaystyle p_{(k)}\leq \frac{\alpha}{n+k-1}

for each {k=1,\cdots, n} and rejecting all the p-values before the first {k} that fails this test.

Proposition 3. For Holm-Bonferroni, we have {\text{FWER}\leq \alpha}.

Proof: Let {n_0=\#\text{ of nulls}}. Then

\displaystyle \mathop{\mathbb P}\left(\min_{\text{nulls }i} \leq \frac{\alpha}{n_0}\right)\leq \alpha,

from a union bound. In this case, we might have false rejections, but this only happens {\alpha} of the time. So we show that the other {1-\alpha} of the time, there are no false rejections.

Suppose {p_i>\frac{\alpha}{n_0}} for all nulls {i}. Then, take the ordered {p}-values {p_{(1)},\cdots, p_{(n)}}. Let {p_{(k)}} be the first {p}-value that comes from a null. We know that this {p}-value can’t be too small, and we know it can’t happen super late in the list, because there are at most {n-n_0} non-nulls. This means that {k\leq n-n_0+1}. There are two cases now: the procedure stops before step {k} (which means no false rejections because no nulls before {k}). Or, it gets to step {k}, and we check if {p_{(k)}\leq \frac{\alpha}{n+1-k}}. But we know that {p_{(k)}>\frac{\alpha}{n_0}} and {k\leq n-n_0+1}, so the {\frac{\alpha}{n+1-k}\leq \frac{\alpha}{n_0}}. By our assumption on the null {p_i} then, we stop at step {k}, and there are no false rejections. \Box

As with the Bonferroni correction, we don’t need any assumptions of independence.

4. Benjamini-Hochberg

This is another multiple testing procedure, but instead of controlling the FWER, it controls

\displaystyle \text{FDR}=\mathop{\mathbb E}\left[\frac{\#\text{ null p-values rejected}}{\#\text{ p-values rejected}}\right],

the false discovery rate. The quantity we’re taking the expectation of is the false discovery proportion (FDP).

Here, we build on Simes’ test by running Simes’, and then rejecting those {k} p-values that are {\leq \alpha\frac{k}{n}}, for the maximum {k} we can do this for. Equivalently, we choose some threshold {\alpha\frac{k}{n}}, and then reject all the p-values below this threshold.

Proposition 4. If the p-values are independent, the Benjamin-Hochberg gives {\text{FDR}=\alpha\frac{n_0}{n}\leq \alpha}.

Proof: Suppose we make {\hat{k}} rejections, so our threshold is {\alpha\frac{\hat{k}}{n}}, meaning that {p_i} is rejected iff {p_i\leq \alpha\frac{\hat{k}}{n}}. The trouble is that here, the RHS also depends on {p_i}, so we have to go through some gymnastics to get around this.

Let {\hat{k}_i} be the largest {k} such that there are {k-1} many p-values {p_j} with {j\neq i} that are {\leq \alpha\frac{k}{n}}. Call this statement the {\text{BH}_i} statement, and the statement that there are {k} many p-values {\leq \alpha\frac{k}{n}} the BH statement.

Claim. If {p_i} is rejected, then {\hat{k}=\hat{k}_i}.

Call the first statement the BH statement, and the second one (with {k-1}) the BH{_\text{i}} statement.

Now, suppose {p_i} is rejected. Then {p_i \leq \alpha \frac{k}{n},} and BH holds for {k=\hat{k}}, so there are {\hat{k}-1} of the {p_j}, {j\neq i}, that are also {\leq \alpha \frac{\hat{k}}{n}}. Thus BH{_{\text{i}}} holds for {k=\hat{k}}, and so {\hat{k}_i\geq \hat{k}}.

On the other hand, {p_i\leq \alpha \frac{\hat{k}}{n}\leq \alpha \frac{\hat{k}_i}{n}}, so BH{_{\text{i}}} is true at {k=\hat{k}_i}, which means there are {\hat{k}_i} values that are {\leq \alpha \frac{\hat{k}_i}{n}}, so BH is true at {k=\hat{k}_i}, and thus {\hat{k}\geq \hat{k}_i}. It follows that {\hat{k}=\hat{k}_i}

With this claim, we can move onto the main result. We have

\displaystyle \text{FDR}=\mathop{\mathbb E}[\text{FDP}]=\mathop{\mathbb E}\left[\frac{\sum_{i\text{ null}} \mathbf{1}\{p_i\text{ is rejected}\}}{\max(1, \hat{k})}\right]=\sum_{i\text{ null}} \mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ is rejected}}{\max(1,\hat{k})}\right].

We can replace the denominator with {\hat{k}_i} without changing the sum, because when they differ, the numerator is 0, so it doesn’t matter. Then, this is

\displaystyle \sum_{i\text{ null}}\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ rejected}\}}{\hat{k}_i}\right]\leq \sum_{i\text{ null}}\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\leq \alpha \frac{\hat{k}_i}{n}\}}{\hat{k}_i}\right].

This is an inequality because if the first numerator is 1, the second is definitely 1, but if the first is 0, the second might not be.

Now, we use the tower law to write this as

\displaystyle \sum_{i\text{ null}} \mathop{\mathbb E}\left[\mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\leq \alpha\cdot \hat{k}_i/n\}}{k_i}\Big| p_j, j\neq i\right]\right].

That conditional expectation is just

\displaystyle \frac{\mathop{\mathbb P}\{p_i\leq \alpha\hat{k}_i/n\mid p_j, j\neq i\}}{\hat{k}_i}=\frac{\alpha \hat{k}_i/n}{\hat{k}_i}=\alpha/n.

This is a constant, and the expected value of that is just {\alpha/n}. When you add this over all the null terms, you get {\alpha\frac{n_0}{n}}, as desired. \Box

Here, we actually get a better bound than {\alpha}, which might seem like a good thing, but also means we’re not making as many discoveries as we could be. We can correct for this by estimating {n_0}, and then running BH with {\alpha\frac{n}{n_0}}, so that the FDR for this procedure will be {\leq \alpha}.

Although we used independence here, we can make do without it, albeit at the cost of a logarithmic factor.

Proposition 5. When running Benjamini-Hochberg, we have {\text{FDR}\leq \alpha\frac{n_0}{n}H_n}.

Proof: As before, we define a score {s_i} for each {p_i} that is 1 if {0\leq p_i\leq \alpha \frac{1}{n}}, {\frac{1}{2}} if {\alpha\frac{1}{n}<p_i\leq \alpha \frac{2}{n},} and so on, giving a score of 0 if {p_i>\alpha}.

We proved above that for a null {p}-value, the expected score is

\displaystyle \frac{\alpha}{n}\left(1+\cdots+\frac{1}{n}\right).

Now, we write

\displaystyle \text{FDR}=\sum_{i\text{ null}} \mathop{\mathbb E}\left[\frac{\mathbf{1}\{p_i\text{ rejected}\}}{\max(1, \hat{k})}\right].

If {p_i} is rejected, then {p_i\leq \alpha \hat{k}/n}, so its score is {\geq 1/\hat{k}}. Put another way,

\displaystyle \frac{\mathbf{1}\{p_i\text{ rejected}\}}{\max(1, \hat{k})}\leq s_i.

Thus,

\displaystyle \text{FDR}\leq \sum_{i\text{ null}} \mathop{\mathbb E}[s_i]=\alpha\frac{n_0}{n}\left(1+\cdots+\frac{1}{n}\right).

\Box

Thoughts on Live TeXing

I decided the summer before coming to college that I wanted to live-TeX notes for my math classes, and dutifully set out to seek advice from the Internet. The results were uncharacteristically disappointing. Only a handful of blog posts and articles discuss the subject, and that handful really only addresses the logistics involved. So this post is a more spiritual take on live-TeX’ing, based on my experiences.

(Caveat emptor: my experiences are probably not universal. Your mileage may vary.)

Before I say anything about TeX’ing, a quick summary of the alternatives. Taking notes on paper, although apparently effective for some people, doesn’t work for me because of my poor handwriting. Not taking notes, although certainly effective in some situations, doesn’t seem like a good idea in every situation. So for me, the choice is between TeX’ing, and sitting back and listening.

TeX’ing notes has the advantage of forcing you to pay attention to the superstructure of a lecture. There’s usually more to a lecture than the union of all the theorems, definitions, and proofs, and typing things like “We want this to be true in general, but…,” or “The punchline is that…,” or “With this example in mind, we define…” quickly draws your attention to them. You could certainly do this with a pen and paper too, but they’re usually too slow to give you time to jot down these longer ideas (unless that’s all you do). The speed of typing gives you extra time to spend on exposition, and crafting this exposition on the fly solidifies your ideas about the material.

This advantage is tempered though, by the difficulties that can arise from TeX’ing long technical arguments. One flavor of such arguments, that involves repeated manipulations of a single expression, is very amenable to being TeX’ed – copy-pasting is significantly faster than rewriting the entire expression. But another flavor involves many different expressions that hardly resemble each other, and these are deeply problematic, especially if matrices rear their ugly head or if subscripts are floating around. I’ve made the mistake before of getting caught up in transcription, only to realize that I have no clue what the math I’m transcribing means.

Another issue is that superstructure isn’t always what you should be paying attention to in a lecture. There’s a certain kind of wild, edge-of-your-seat lecture, where it’s all you can do to hang on and follow the speaker line by line. In these talks, trying to suss out those big ideas not only fails, but actually detracts from your ability to understand the basic material. On the other end of the spectrum, some lectures are on material that you’ve seen or are completely familiar with, and have already internalized the yoga of, so this advantage completely evaporates.

A great deal also depends on the lecturer. Obviously good lecturers are easy to take notes on, and bad lecturers hard, but many otherwise insignificant quirks can make a particular lecturer a pain to live-TeX.

Some lecturers are fond of nesting things. They’ll start a theorem, realize they need a definition, make a few remarks about the definition, maybe prove a claim. This is all fine at the board, but if you’re working (as I often do) with boxes around theorems and definitions on the page, it doesn’t translate well at all.

Other lecturers are prone to making corrections to their board work that are easy at the board, but hard on a computer. Swapping entries in a large matrix, adding forgotten tildes, or erasing and rewriting part of an expression while verbally explaining the manipulation all fall into this category.

And finally, there are lecturers who draw pictures. Simpler pictures, or complicated pictures drawn once and left alone are usually fine, and can be recreated afterward if they’re essential, but sometimes you have lectures that basically consist of progressively adding to a single image over the course of an hour, and these are really a lost cause. I have no idea how to handle them.

Of course, it’s very hard to synthesize all of these ideas into a hard and fast rule for deciding whether or not to TeX a talk. The system I’ve developed is that unless I’m very confident that it won’t work out, I’ll err on the side of trying to TeX, but am also very forgiving of myself for giving up halfway through a talk. It’s not a perfect system, but it gives reasonable results, especially as I become more aware of the red flags that tell me to stop TeX’ing.

Martingale Convergence

Last quarter, I did a DRP  on martingales, and at the start of this quarter, gave a short talk on what I learned. Here are the notes from that talk.

1. Conditional Expectation

Definition 1. Suppose {(\Omega, \mathcal{F}_o, \mathbb{P})} is a probability space, {\mathcal{F}\subset \mathcal{F}_o} is a sub {\sigma}-algebra, and {X} is a random variable.

  1. The conditional expectation of {X} given {\mathcal{F}}, {E(X\mid \mathcal{F})}, is any {\mathcal{F}}-measurable random variable {Y} such that

    \displaystyle \int_A X\, d\mathbb{P}=\int_A Y\, d\mathbb{P}\text{ for all } A\in \mathcal{F}.

  2. If {Y} is another random variable, then {E(X\mid Y)} is defined as {E(X\mid \mathcal{F})} where {\mathcal{F}} is the {\sigma}-algebra generated by {Y}.

Fact. The conditional expectation exists, is unique, and is integrable.

Intuitively, we can think of {E(X\mid \mathcal{F})} as the best guess of the value of {X} given the information available in {\mathcal{F}}.

Example 1.

  1. If {X} is {\mathcal{F}}-measurable, then {E(X\mid \mathcal{F})=X}.
  2. If {X} and {Y} are independent, then {E(X\mid Y)=E(X)}.
  3. Let {X_1,\cdots} be random variables with mean {\mu}, and let {S_n=X_1+\cdots+X_n} be the partial sums. Then, if {m>n},

    \displaystyle E(S_m\mid S_n)=E(X_m+\cdots+X_{n+1}+S_n\mid S_n)=E(X_m\mid S_n)+\cdots+E(X_{n+1}\mid S_n)+E(S_n\mid S_n)=\mu(m-n)+S_n.

    The idea is that you know your position at {S_n}, and you take {m-n} steps whose sizes are, on average, {\mu}, so your best guess for your position is {S_n+\mu(m-n)}.

2. Martingales

A martingale is a model of a fair game.

Definition 2. Consider a filtration (increasing sequence of {\sigma}-algebras) {\mathcal{F}_n} and a sequence of random variables {X_n}, each measurable with respect to {\mathcal{F}_n} and integrable. Then, if {E(X_{n+1}\mid \mathcal{F}_n)=X_n} for all {n}, we say {S_n} is a martingale.

Example 2. Let {X_1,\cdots} be random variables that take only the values {-1} and {1} with probability {1/2} each. Then {S_n=X_1+\cdots+ X_n} is a martingale, because

\displaystyle E(S_{n+1}\mid S_n)=E(X_{n+1}+S_n\mid S_n)=E(X_{n+1}\mid S_n)+E(S_n\mid S_n)=E(X_{n+1})+S_n=S_n.

Theorem 3. If {X_n} is a martingale with {\sup E(X_n^+)<\infty}, then {X_n} converges a.s. to a limit {X} with {E(|X|)<\infty}.

I’ll only sketch this proof, because even though the idea is nice, the details are a little annoying. The idea is to set up a way of betting on a martingale, show that you can’t make money off such a betting system, and then use this to draw conclusions about the martingales behavior.

Definition 4. Let {\mathcal{F}_n} be a filtration. A sequence {H_m} of random variables is said to be predictable if {H_m} is {\mathcal{F}_{m-1}} measurable for all {m}.

That is, the value of {H_m} can always be predicted with certainty at time {m}. Then, if we “bet” an amount {H_m} on the martingale {X_m}, our total winnings will be

\displaystyle (H\cdot X)_n=\sum_{m=1}^n H_m(X_m-X_{m-1}).

If {X_n} is a supermartingale and {H_n} is bounded, then {(H\cdot X)_n} is a supermartingale as well.

Now, fix an interval {(a,b)} and choose {H_n} in the following way: if {X_n<a}, then bet 1, and continue to do so until {X_n>b}. Then bet zero until you fall back below {a}, and repeat. Every time you go from below {a} to above {b}, you will make a profit of at least {(b-a)}. Let {U_n} be the number of these upcrossings that occur. Then, you can use the previous fact, together with the bound on {\sup E(X_n^+)}, to show that the number of these upcrossings is finite with probability 1. Since this is true for arbitrary {(a,b)}, the {X_n} must converge.

Finally, we give a brief application of the martingale convergence theorem.

Proposition 5. Let {X_n} be a sequence of random variables such that {P(X_n=1)=P(X_n=-1)=\frac{1}{2}.} Then {{\sum_{j=1}^{\infty} \frac{X_j}{j}}} converges with probability 1.

Proof: Let {{S_n=\sum_{j=1}^n\frac{X_j}{j}}.} It’s easy to check that {S_n} is a martingale, and that {E(S_n)} is bounded (in fact, it’s always 0), so the {S_n} converge almost surely. Thus the random harmonic series converges almost surely. \Box

Mercer’s Theorem and SVMs

In a funny coincidence, this post has the same basic structure as my previous one: proving some technical result, and then looking at an application to machine learning. This time it’s Mercer’s theorem from functional analysis, and the kernel trick for SVMs. The proof of Mercer’s theorem mostly follows Lax’s Functional Analysis.

1. Mercer’s Theorem

Consider a real-valued function {K(s,t)}, and the corresponding integral operator {\mathbf{K}: L^2[0,1]\rightarrow L^2[0,1]} given by

\displaystyle (\mathbf{K} u)(s)=\int_0^1 K(s,t) u(t)\, dt.

We begin with two facts connecting the properties of {K} to the properties of {\mathbf{K} }.

Proposition 1. If {K} is continuous, then {\mathbf{K} } is compact.

Proof: Consider a bounded sequence {\{f_n\}_{n=1}^{\infty} \subset L^2[0,1]}. We wish to show that the image of this sequence, {\{\mathbf{K} f_n\}_{n=1}^{\infty}}, has a convergent subsequence. We show that {\{\mathbf{K} f_n\}} is equicontinuous, and Arzela-Ascoli then gives a uniformly convergent subsequence, which in turns gives that {\mathbf{K} } is compact.

Since {K(x,y)} is continuous, for every {\epsilon>0}, there exists {\delta>0} such that

\displaystyle |x-\tilde{x}|<\delta \implies |K(x,y)-K(\tilde{x}, y)|\leq \epsilon.

Then, if {|x-\tilde{x}|<\delta}, we have that

\displaystyle |\mathbf{K} f_n(x)- \mathbf{K} f_n(\tilde{x})| = \left| \int_0^1 K(x, t) f_n(t)\, dt \right| -\left| \int_0^1 K(\tilde{x}, t) f_n(t)\, dt \right|

which is bounded by

\displaystyle \int_0^1 |K(x,t)-K(\tilde{x},t)| |f_n(t)| \, dt \leq \epsilon \int_0^1 |f_n(t)|\, dt\leq \epsilon M,

where {M} is an upper bound on the norm of the {f_n}. The equicontinuity of the {\mathbf{K} f_n}, and thus the compactness of {\mathbf{K} }, follows. \Box

Proposition 2. If {K} is symmetric, then {\mathbf{K} } is self-adjoint.

Proof: We directly compute that

\displaystyle \langle \mathbf{K} u, v\rangle = \int_0^1\int_0^1 K(s,t)\, u(s) v(t)\, ds\, dt

and

\displaystyle \langle u, \mathbf{K} v\rangle = \int_0^1\int_0^1 K(t,s)\, u(s) v(t)\, ds\, dt,

which will be equal if {K(s,t)=K(t,s)}, so {\mathbf{K} } is self-adjoint in this case. \Box

Thus, when {K} is continuous and symmetric, the operator {\mathbf{K} } is a compact self-adjoint operator, and so we can apply the spectral theorem to find a set of orthonormal eigenfunctions {e_j} with real eigenvalues {\kappa_j} that form a basis for {L^2[0,1]}. The following theorem connects {K} to these eigenfunctions and eigenvalues in the case where {\mathbf{K} } is positive.

Theorem 3 (Mercer). If {K} is symmetric and continuous, and the associated operator {\mathbf{K} } is positive i.e. {\langle \mathbf{K} u, u\rangle \geq 0} for all {u}, then we can write

\displaystyle K(s,t)=\sum_{j=1}^{\infty} \kappa_j e_j(s)e_j(t),

where the {\kappa_j} and {e_j} are the eigenfunctions and eigenvalues of {\mathbf{K} }.

To prove this, we’ll need a lemma.

Lemma 4. If {\mathbf{K} } is positive, then {K} is non-negative on the diagonal.

Proof: Suppose for the sake of contradiction that {K(r,r)<0} for some {r}. Choose {s,t} so close to {r} that {K(s,t)} is negative, and choose {u} to be a bump function that is zero except in a small neighborhood around {(r,r)}. Then,

\displaystyle \langle \mathbf{K} u, u\rangle = \int_0^1\int_0^1 K(s,t) u(t) u(s)\, ds \, dt

will be negative, contradicting the positivity of {\mathbf{K} }. \Box

Proof of Mercer’s Theorem: Define a sequence of functions

\displaystyle K_N(s,t)=\sum_{j=1}^N \kappa_j e_j(s) e_j(t),

and let {\mathbf{K} _N} be the associated sequence of integral operators. We first show that the {K_N} converge uniformly.

The operator {\mathbf{K} -\mathbf{K} _N} has eigenfunctions {e_j} and eigenvalues {\kappa_j} for {j>N}, so it is a positive operator. But then {K-K_N} must be non-negative on the diagonal by our lemma, and so

\displaystyle K(s,s)-\sum_{j=1}^N \kappa_j e_j^2(s)\geq 0\implies K(s,s)\geq \sum_{j=1}^N \kappa_j e_j^2(s)

for all {N}. Since the sum on the right is always non-negative, it must converge for every {s}. This gives point-wise monotone convergence of the sequence of partial sums, which means they converge uniformly in {s}.

But now by Cauchy-Schwarz,

\displaystyle \left|\sum_{j=1}^N \kappa_j e_j(s) e_j(t)\right| \leq \left(\sum_{j=1}^N \kappa_j e_j^2(s)\right)^{1/2}\left(\sum_{j=1}^N \kappa_j e_j^2(t)\right)^{1/2},

and since both of the series on the right converge uniformly, the series on the left does as well.

Let {K_{\infty}} be the limit of the {K_N}. Then, by definition, {\mathbf{K} } and {\mathbf{K} _{\infty}} agree on all of the {e_j} and their linear combinations, and since the {e_j} span the space, this means {\mathbf{K} } and {\mathbf{K} _{\infty}} agree everywhere, and so they must be the same operator. But this means they have the same kernel, so {K= K_{\infty}}, completing the proof. \Box

2. The Kernel Trick

The basic idea of the kernel trick is this: let’s say you have some data points {\mathbf{x}_1, \mathbf{x}_2,\cdots, \mathbf{x}_N}. If you want to transform these data points by some map {\phi}, you have to apply {\phi} to all of them, and if you want to run an SVM, you have to compute all {N^2} dot products {\phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}_j)}. For large {N}, this can be a huge pain, so the dream is to choose {\phi} so that you can compute these dot products without actually having to transform the data i.e. find some {K} such that

\displaystyle \phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}_j)= K(\mathbf{x}_i, \mathbf{x}_j).

Mercer’s theorem allows us to do exactly this, but in the other direction – if we choose {K} so that the associated integral operator {\mathbf{K} } is positive definite, then we can write

\displaystyle K(\mathbf{x}_i, \mathbf{x}_j)=\sum_{j=1}^{\infty} \kappa_j e_j(s) e_j(t)= \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j)\rangle,

where

\displaystyle \phi: \left[\begin{array}{ccc} x_1& x_2&\cdots \end{array}\right]\mapsto \left[\begin{array}{ccc} \sqrt{\kappa_1} e_j(x_1)& \sqrt{\kappa_1} e_j(x_2)&\cdots \end{array}\right].

So instead of trying to choose {\phi} so that {K} exists, we just choose an appropriate {K}, and don’t even need to think about {\phi}. In particular, we can work with the data in another feature space without running directly into computational barriers.

Conditional Expectation as Projection

This short note gives a geometric perspective on conditional expectation, and a nice application of this fact to regression.

1. Conditional Expectation as Projection

Theorem 1. Let {(\Omega, \mathcal{F}, \mathbb{P})} be a probability space, let {\tilde{\mathcal{F}}\subset \mathcal{F}} be a sub {\sigma}-algebra of {\mathcal{F}}, and let {X} be an {L^2} random variable. Then {\mathbb{E}(X\mid \tilde{F})} is the projection of {X\in L^2(\mathcal{F})} onto {L^2(\tilde{F})}.

Before we prove this, let’s set the measure theory aside for a second and think intuitively: a common way to introduce conditional expectation without measure theory is to call it “the best guess you would make for {X} given some information.” With this in mind, it makes sense that the conditional expectation would be the point in the information you have that is “closest” to {X}. But the nice thing is that this isn’t just a cute analogy – it’s a precise fact.

Proof: The norm in {L^2(\Omega)} is the square integral, so it’s sufficient to show that

\displaystyle \mathbb{E}(X\mid \tilde{F})=\arg\min \mathbb{E}(X-Y)^2.

To do this, we’ll need a lemma.

Lemma 2. If {Z\in L^2(\tilde{F})}, then {\mathbb{E}(Z(X-\mathbb{E}(X\mid \tilde{\mathcal{F}})))=0.}

Proof: By Cauchy-Schwarz, {ZX\in L^1(\Omega)}, so we have

\displaystyle Z\mathbb{E}(X\mid\tilde{F})=\mathbb{E}(ZX\mid \tilde{F})\implies \mathbb{E}(Z\mathbb{E}(X\mid \tilde{F}))=\mathbb{E}(\mathbb{E}(ZX\mid\tilde{\mathcal{F}}))=\mathbb{E}(ZX),

and then subtracting gives the result. \Box

With this lemma in hand, take some {Y\in L^2(\tilde{\mathcal{F}})}. Then, we have

\displaystyle \mathbb{E}((X-Y)^2)=\mathbb{E}((X-\mathbb{E}(X\mid\tilde{\mathcal{F}})+\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y)^2),

and expanding this gives

\displaystyle \mathbb{E}((X-\mathbb{E}(X\mid \tilde{F}))^2)+ 2\mathbb{E}([X-\mathbb{E}(X\mid \tilde{\mathcal{F}})][\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y])+\mathbb{E}((\mathbb{E}(X\mid\tilde{\mathcal{F}})-Y)^2).

The cross-term vanishes because of our lemma. The first term does not depend on {Y}. The second is clearly always positive, and is zero when {Y=\mathbb{E}(X\mid \tilde{\mathcal{F}})}, so this must be the minimum. The result follows. \Box

2. Regression

This new paradigm has at least one nice application: understanding linear regression better.

Suppose you have some data points that are generated from jointly distributed random variables {X} and {Y}, which satisfy {Y=\alpha X+\beta}, where {\alpha, \beta} are constants. We observe a bunch of data from this distribution, {(\mathbf{x}_1, y_1),\cdots, (\mathbf{x}_n, y_n)}, where {y_i=a\mathbf{x}_i+b+\epsilon} and {\epsilon\sim N(0,1)} is Gaussian noise.

We call {\mathbb{E}(Y\mid X)=\alpha X+\beta} the true regression function, both because it is the true function and because a direct computation gives that it is the function that minimized the expected squared error of a future prediction. In practice, we don’t know the distributions of {X, Y}, so we have to approximate the true regression function with an empirical regression function, {\widehat{\mathbb{E}(Y\mid X)}=aX+b}.

To approximate the true regression function with the data, we want to approximate the projection of {Y} onto the {\sigma}-algebra generated by {X}. But we have no idea what this {\sigma}-algebra looks like, except that it contains the data points we saw, so we just consider the space consisting of those data points instead. Then, the approximate projection will be found by choosing {a, b} to minimize the {L^2} distance between our empirical regression function and the approximate {y}-values (we use these because we don’t know the exact {y}-values). This means choosing {a,b} to minimize

\displaystyle \sum_{i=1}^n (a\mathbf{x}_i+b-y_i)^2.

But this is exactly the same as linear least squares regression. The fact that taking this statistical perspective about regression gives the same result as just deciding to minimize the sum of the squares (which can be derived in other ways as well) is not a coincidence, but rather a consequence of the fact that conditional expectations are also projections.